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This paper explores a class of empirical Bayes methods for level- 
dependent threshold selection in wavelet shrinkage. The prior consid- 
ered for each wavelet coefficient is a mixture of an atom of probabil- 
ity at zero and a heavy-tailed density. The mixing weight, or sparsity 
parameter, for each level of the transform is chosen by marginal max- 
imum likelihood. If estimation is carried out using the posterior me- 
dian, this is a random thresholding procedure; the estimation can also 
be carried out using other thresholding rules with the same threshold. 
Details of the calculations needed for implementing the procedure are 
included. In practice, the estimates are quick to compute and there is 
software available. Simulations on the standard model functions show 
excellent performance, and applications to data drawn from various 
fields of application are used to explore the practical performance of 
the approach. 

By using a general result on the risk of the corresponding marginal 
maximum likelihood approach for a single sequence, overall bounds 
on the risk of the method are found subject to membership of the 
unknown function in one of a wide range of Besov classes, covering 
also the case of / of bounded variation. The rates obtained are op- 
timal for any value of the parameter p in (0,oo], simultaneously for 
a wide range of loss functions, each dominating the norm of the 
crth derivative, with a > and < q < 2. 

Attention is paid to the distinction between sampling the unknown 
function within white noise and sampling at discrete points, and be- 
tween placing constraints on the function itself and on the discrete 
wavelet transform of its sequence of values at the observation points. 
Results for all relevant combinations of these scenarios are obtained. 
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In some cases a key feature of the theory is a particular boundary- 
corrected wavelet basis, details of which are discussed. 

Overall, the approach described seems so far unique in combin- 
ing the properties of fast computation, good theoretical properties 
and good performance in simulations and in practice. A key feature 
appears to be that the estimate of sparsity adapts to three differ- 
ent zones of estimation, first where the signal is not sparse enough 
for thresholding to be of benefit, second where an appropriately cho- 
sen threshold results in substantially improved estimation, and third 
where the signal is so sparse that the zero estimate gives the optimum 
accuracy rate. 

1. Introduction. 

1.1. Background. Consider the nonparametric regression problem where 
we have observations at 2"^ regularly spaced points ti of some unknown 
function / subject to noise 



where the Si are independent A^(0,cj|;) random variables. The standard 
wavelet-based approaches to the estimation of / proceed by taking the dis- 
crete wavelet transform of the data Xi, processing the resulting coefficients 
to remove noise, usually by some form of thresholding, and then transform- 
ing back to obtain the estimate. 

The underlying notion behind wavelet methods is that the unknown func- 
tion has an economical wavelet expression, in that / is, or is well approxi- 
mated by, a function with a relatively small proportion of nonzero wavelet 
coefficients. The quality of estimation is quite sensitive to the choice of 
threshold, with the best choice being dependent on the problem setting. In 
general terms, "sparse" signals call for relatively high thresholds {3aE, ^cfe 
or even higher), while "dense" signals might demand choices of 2(Je or even 
lower. Indeed, it is typical that the wavelet coefficients of a true signal will 
be relatively more sparse at the fine resolution scales than at the coarser 
scales. It is therefore desirable to develop threshold selection methods that 
adapt the threshold level by level. 

One would hope that such methods would estimate thresholds that sta- 
bly reflect the gradation from sparse to dense signals as the scale changes 
from fine to coarse. It has proven elusive to construct threshold selectors that 
combine properties such as these with good theoretical properties. The prin- 
cipal motivation for the work reported in the present paper is to show that 
a simple empirical Bayesian approach combines computational tractability 
with good theoretical and practical performance. For software availability, 
see Section 1.8. 



(1) 
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While the present paper is concerned with the nonparametric regression 
model (1) and wavelet transforms, the same levelwise empirical Bayes ap- 
proach is, in principle, directly applicable to other direct and indirect trans- 
form shrinkage settings with multiscale block structure, as briefly discussed 
in [28]. 

1.2. Bayesian approaches to wavelet regression. Within a Bayesian con- 
text, the notion of sparsity is naturally modeled by a suitable prior distri- 
bution for the wavelet coefficients of /. Write dji. for the elements of the 
discrete wavelet transform (DWT) of the vector of values f{ti) and d*f^ for 

the DWT of the observed data Xi. Let N = 2'^ and 9jk = N'^/^ djk- 

Clyde, Parmigiani and Vidakovic [13], Abramovich, Sapatinas and Silver- 
man [4] and Silverman [46] have considered a particular mixture prior for 
this problem. Under this prior, the djfc are independently distributed with 

(2) djfc~(l-7r,)5(0)+7rj7V(0,r|), 

a mixture of an atom of probability at zero and a normal distribution with 
variance tJ. The parameters of the distribution (2) depend on the level j of 
the coefficient in the transform. A related prior was considered by Chipman, 
Kolaczyk and McCulloch [11]; for a survey of work in this area, see [48]. See 
also [12, 38, 44, 50, 51] for a range of approaches to the modeling of the 
wavelet coefficients underlying a function or image. [31] is an early version 
introducing the approach of the present paper. 

The most popular summary of the posterior distribution under the model (2) 
has been the posterior mean, but Abramovich, Sapatinas and Silverman [4] 
investigated the use of the posterior median of djk as a summary of the 
posterior distribution. This is a true thresholding rule, in that for less 
than some threshold, the point estimate of djk will be exactly zero. In the 
wavelet context, the coefficient- wise posterior median corresponds to a point 
estimate of the posterior distribution under a family of loss function equiva- 
lent to norms on the function and its derivatives. Such losses are in any 
case more natural if one wishes to allow for the possibility of inhomogeneous 
functions, one of the aims of the wavelet approach. 

1.3. Choosing the parameters in the prior. How should the parameters 
in the prior be chosen? In much of the existing literature, the parameters 
are either chosen directly by reference to prior information about /, or by a 
combination of prior information and data-based criteria. Though some of 
these, for example, the BayesThresh approach of Abramovich, Sapatinas and 
Silverman [4], give good results, they clearly invite the possibility of a more 
systematic approach to the choice of the hyperparameters. In the present pa- 
per we take an empirical Bayes (or marginal maximum likelihood) approach. 
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which yields a completely data-based method of choosing the prior parame- 
ters. Within the Bayesian formulation set out above, wavelet regression at a 
single resolution level j is a special case of a single sequence Bayesian model 
selection problem considered, among others, by George and Foster [24, 25]. 
This problem is considered in detail by Johnstone and Silverman [33]; we 
review the basic method presented there and also give some additional im- 
plementational details. 

Suppose that Z = (Zi, . . . , Z„) are observations satisfying 

(3) Zi = fii + ei, 

where the £i are independent A^(0, 1) random variables. It is supposed that 
the unknown coefficients /ij are mostly zero, but some of them may be 
nonzero, and, with this in mind, it is of interest to estimate the fii on the 
basis of the observed data. In the model selection context, the nonzero Hi 
correspond to parameters that actually enter the model. The connection 
with wavelet regression is natural: the Zi might be the sample wavelet coef- 
ficients (suitably renormalized) at a particular level, and these are noisy ob- 
servations of a sequence of population wavelet coefficients which are mostly 
zero. 

The parameters Hi are modeled as having independent prior distributions 
each given by the mixture 



The nonzero part of the prior, 7, is assumed to be a fixed unimodal symmet- 
ric density. In most of the previous wavelet work cited above, the density 7 
is a normal density, but we use a heavier-tailed prior, replacing the A^(0, tJ) 
part of the mixture (2) by, for example, a double exponential distribution 
with a scale parameter that may depend on the level of the coefficient in 
the transform. Another possible prior, with still heavier tails, is introduced 
in Section 2. Apart from the theoretical advantages of such an approach, 
Wainwright, Simoncelli and Willsky [50] argue that the marginal distribu- 
tion of the wavelet coefficients of images arising in practice typically has tails 
heavier than Gaussian. In the Bayesian setup, the noise (ej) is independent 
of the wavelet coefficients. 

Let g = -y -kip, where -k denotes convolution. To avoid confusion with the 
scaling function of the wavelet family, we use 99 to denote the standard 
normal density. The marginal density of the observations Z^ will then be 



We define the marginal maximum likelihood estimator w of w to be the 
maximizer of the marginal log-likelihood 



(4) 



/prior(^) = (1 - w)6o{lJ,) + wyifl). 



(1 — 'w)ip{z) + wg{z). 



n 



(5) 



i{w) = J2 log{(l - wMZ^) + wg{Z,)} 



i=l 
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subject to the constraint on w that the threshold satisfies t{w) < \/2 logn. 
This upper hmit on the threshold is the universal threshold, which has the 
property that it is asymptotically the largest absolute value for observa- 
tions obtained from a zero signal, and can therefore be considered to be the 
appropriate limiting threshold as zz; — > 0. 

Our basic approach is then to plug the value w back into the prior and then 
estimate the parameters /ij by a Bayesian procedure using this value of w. 
Suppose fi has prior (4) and that we observe Z ~ N{fi, 1). Let //(z; w) be the 
median of the posterior distribution of ^ given Z = z and jl{z; w) its mean. 
If the posterior median is used, then /ij will be estimated by fii = fi{Zi,w), 
while the corresponding estimate using the posterior mean is fii = fi{Zi;w). 

For fixed w <1, the function /t(z; w) will be a monotonic function of z with 
the thresholding property, in that there exists t{w) > such that w) = 
if and only if \z\ < t^w). The estimated weight w thus yields an estimated 
threshold t{w) = t, say. A simple extension of the method is to retain the 
threshold t but to use a more general thresholding rule, for example, hard 
or soft thresholding. The main emphasis of this paper is on the choice of the 
threshold, rather than on the choice between different thresholding rules. 

The posterior mean rule fl{z;w) fails to have the thresholding property, 
and, hence, produces estimates in which, essentially, all the coefficients are 
nonzero. Nevertheless, it has shrinkage properties that allow it to give good 
results. We shall see that, both in theory and in simulation studies, the 
performance of the posterior mean is good, but not quite as good as the 
posterior median. 

The same approach can be used to estimate other parameters of the prior. 
In particular, if a scale parameter a is incorporated by considering a prior 
density (1 — w)6o{iJ,) + wa'y{afj:), define ga to be the convolution of aj{a-) 
with the normal density. Then both a and w can be estimated by finding 
the maximum over both parameters of 

n 

i{w,a)=Y,\og{il-w)ipiZi)+wga{Zi)}. 

1=1 

In the case where there is no scale parameter to be estimated, £'{w) is 
a monotonic function oi w, so its root is very easily found numerically, 
provided the function g is tractable. If one is maximizing over both w and 
a, then a package numerical maximization routine that uses gradients has 
been found to be an acceptably efficient way of maximizing i{w,a). 

Details of relevant calculations for some particular priors are given in 
Section 2.2. All these calculations are implemented in the authors' package, 
Johnstone and Silverman [34] , and the documentation of that package gives 
further details beyond those given in this paper. 
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1.4. Marginal maximum likelihood in the wavelet context. In the wavelet 
context, the MML approach is applied to each level of the wavelet transform 
separately, to yield values of w and, if appropriate, a that depend on the 
level of the transform. Let (t| be the standard deviation of the noise at 
level j. Assuming that the original noise is independent, the variance (t| 
will be the same for all j and can, as is conventional, be estimated from the 
median of the absolute values of the coefficients at the highest level. More 
generally, for example, in the case of stationary correlated noise, it may be 
appropriate to estimate aj separately for each level, at least at the higher 
levels of the transform. In this paper we have not considered the effect of 
sampling variability in the estimation of the noise variance, but that would 
be an interesting topic for future research. 

At level J, define the sequence = d*i^/crj, and apply the single sequence 
MML approach to this sequence to obtain wj and, if appropriate, estimates 
of any other parameters of the prior. The estimated wavelet coefficients of 
the discrete wavelet transform of the sequence /(tj) are then given by 

(6) djk = ajfi{d*i,/aj;wj). 

Assuming, without loss of generality, that the function / is defined on the 
interval [0, 1] and the values tj = i/N, crude estimates of the wavelet coeffi- 
cients of the function / are then 9jk = N~^^'^djk, neglecting boundary issues 
for the moment. 



Straightforward generalizations. Natural generalizations of (6) include 
the inclusion of estimates of other parameters in the prior, as well as the 
use of the posterior mean instead of the posterior median, or the use of a 
more general thresholding rule than the posterior median, but still using 
the posterior median threshold t{w). In addition, we consider two further 
generalizations. 

Modified thresholds for the estimation of derivatives. When wavelet meth- 
ods are used to estimate derivatives, it was shown by Abramovich and Sil- 
verman [5] that the appropriate universal threshold is not \/21ogn, but is 
a multiple of this quantity. We develop theory below using, for the estima- 
tion of derivatives, a modified threshold t^iw) given, for some appropriately 
chosen ^ > 0, by 

(7) f .(',„'] = f*^^ if <21ogn-51oglogn, 

^' 1 V2(l + ^)logn, otherwise. 

The translation-invariant wavelet transform. It is by now well recognized 
that the translation-invariant wavelet transform [15], in general, gives much 
better results than the conventional transform applied with a fixed origin. 
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At each level j, the translation-invariant transform gives a sequence of 2 
values that are not actually independent. Each subsequence obtained by 
regular selection at intervals 2'^~^ will be independent, and corresponds to 
the coefficients at level j of the standard wavelet expansion with a particular 
choice of origin. 

One way of proceeding would be to apply the empirical Bayes method 
entirely separately for each of these subsequences to obtain estimates of 
the relevant coefficients in the translation-invariant wavelet transform. It is 
simpler and more natural, however, to use the same estimates of the mixture 
hyperparameters for every position of the time origin, thereby borrowing 
strength in the estimation of the hyperparameters between the different 
positions of the origin. To obtain a single estimate at each level, we maximize 
the average, over choice of origin, of the marginal log-likelihood functions. 
This average is 2~^'^~-'^ times the "as-if- independent" log-likelihood function 
obtained by simply summing the log- likelihoods for each of the 2"^ coefficients 
at level j in the translation-invariant transform. 

The estimates of the mixture parameters are then used to give individ- 
ual posterior medians of each of the coefficients of the translation-invariant 
transform, and the estimated function is found by the average basis ap- 
proach. Apart from the combination of log-likelihoods involved in the esti- 
mation of the hyperparameters, the translation-invariant method gives the 
result of applying the standard method at every possible choice of time 
origin, and then averaging over the position of the time origin. 

Using an as-if-independent likelihood at each level to choose the hyper- 
parameters is reminiscent of the independence estimating equation approach 
of Liang and Zeger [35] to parameter fitting in the marginal distribution 
of a sequence of identically distributed but nonindependent observations. 
Their paper was concerned with observations with generalized linear model 
dependence on the parameters and covariates. Because, for different choices 
of origin, the prior distributions on the coefficients are not, in general, gen- 
erated from a single underlying prior model for the curve, our translation- 
invariant procedure involves a separate modeling of the prior information at 
each origin position, modulo 2'^~^ for the coefficients at level j. Independence 
estimating equations, as we have used them, are a method of combining the 
separate problems of choosing the prior into a single problem at each level. 

1.5. Theoretical approach and results. By now a classic way to study 
the adaptivity of wavelet smoothing methods is through the study of the 
worst behavior of a method when the wavelet coefficients of the function / 
are constrained to lie in a particular Besov sequence space, corresponding 
to Besov function space membership of the function itself. Besov spaces 
are a flexible family that, depending on their parameters, can allow for 
varying degrees of inhomogeneity, as well as smoothness in the functions 
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that they contain. Some relations between Besov spaces and spaces defined 
by Lp norms on function and their derivatives are reviewed in Section 5.6. 
We shall show that the empirical Bayes method with a suitable function 7 
automatically achieves the best possible minimax rate over a wide range of 
Besov spaces, including those with very low values of the parameter p that 
allows for inhomogeneity in the unknown function /. 

A particular case of the theory we develop is as follows; fuller details 
of the assumptions will be given later in the paper. Suppose that we have 
observations Xi = f{ti) + Ei of a function f at N regularly spaced points 
ti, with Si independent N{0,a'^) random variables. Let djk = N^^'^Ojk be 
the coefficients of an orthogonal discrete wavelet transform of the sequence 
f{ti), and let dj denote the vector with elements djk as k varies. 

Assume that the coarsest level to which the wavelet transform is carried 
out is a fixed level L > 0. Denote by di-i the vector of scaling coefficient(s) 
at this level. If periodic boundary conditions are being used and iV is a 
power of 2, the vector dj is of length 2^ ii j > L and 2^ \i j = L — 1, and 
N = 2"^ , where J — 1 is the finest level at which the sample coefficients are 
defined. 

To allow for discrete wavelet transforms based on other boundary condi- 
tions and with values of that are other suitable multiples of powers of 2, 
we shall make the milder assumptions that dj is defined for L — 1 < j < J, 
with L fixed and J — > 00 as N ^ 00, that the sum of the lengths of the dj 
is equal to A^, and that the length of each dj for j > L is in the interval 
[2-'"^, 2-^]. The length of the vector di^i of scaling coefficients is assumed to 
lie in [2^-\2^], so that 2-^~^ <N <2-^. 

Estimate the coefficients djk for j > L by the estimate in (6), applying 
an empirical Bayes approach level by level, based on a mixture prior with a 
heavy-tailed nonzero component 7. The estimator can be either the poste- 
rior median or some other thresholding rule using the same threshold (and 
obeying a bounded shrinkage condition set out later). The scaling coefficients 
di-i are estimated by their observed values obtain the estimates 

f{ti) of the function values f{ti), apply the inverse discrete wavelet trans- 
form to the estimated array djk- 

For < p < 00 and a> | — 2, let a = a — i + Define the Besov sequence 
space 6p oo(C') to be the set of all coefficient arrays 9 such that 

(8) J2 l^ifcl^ < CP2-"fJ' for all j with L - 1 < j < J. 

k 

Our theory shows that, for some constant c, possibly depending on p and a 
but not on A^ or C, 

N 

sup N~^EY^{f{U)-f{U)Y 
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< c{C2/(2"+l)Ar-2-/(2a+l) + AT"! (log Ar)4} . 

For fixed C, the second term in the bound (9) is negligible, and the rate 
0(A^~^"/(^"~'"^)) of decay of the mean square error is the best that can be at- 
tained over the relevant function class. The result (9) thus shows that, apart 
from the 0{N~^ log^ N) term, our estimation method simultaneously attains 
the optimum rate over a wide range of function classes, thus automatically 
adapting to the regularity of the underlying function. Under conditions we 
shall discuss, the Besov sequence space norm used in (8) is equivalent to a 
Besov function space norm on / with the same parameters. 

The main theorem of the paper goes considerably beyond (9), in the 
following respects: 

• It demonstrates the optimal rate of convergence for mean (7-norm errors 
for all < g < 2, not just the mean square error considered in (9). 

• Beyond the posterior median, any thresholding method satisfying certain 
mild conditions can be used, and, for 1 < g < 2, the results also hold for 
the posterior mean. 

• If an appropriate modified threshold method is used, the optimality also 
extends to the estimation of derivatives of /. 

Most of the existing statistical wavelet literature concentrates explicitly or 
implicitly on the white noise model, where we assume that we have indepen- 
dent observations of the wavelet coefficients of the function up to some reso- 
lution level. Little attention has been paid to the errors possibly introduced 
by the discretization of /. However, Donoho and Johnstone [20] discuss a 
form of discretization somewhat different from simple sampling at discrete 
points. Another issue not considered in detail in much of the present litera- 
ture is the careful treatment in a statistical context of boundary-corrected 
wavelet methods, such as those introduced by Cohen, Daubechies and Vial 
[14]. In the current paper we do consider the effects of discretization and of 
boundary correction, and we prove theorems for both the white noise model 
and for a sampled data model. 

In particular, suppose that the function / is observed on [0, 1] at a regular 
grid of = 2"^ points, subject to independent N{0,a^) noise. Proceeding 
as above, but with an appropriate preconditioning of the data near the 
boundaries and treatment of the boundary wavelet coefficients, construct an 
estimate of / itself by setting / = J2k ^L~i,k4'Lk + Y.L<j<j Y.k hk^l^jk, where 
and V'jfc are the scaling functions and wavelets at scale j. Let ^{C) be 
the class of functions / whose true wavelet coefficients fall in hp ,^{C). Under 
appropriate mild conditions, a special case of our theory demonstrates that 

(10) sup E /'{/(t)-/(t)}2<cC72/(2a+l)^-2a/{2a+l)^^(^-2a/{2a+l))^ 
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Our results go far beyond mean integrated square error and consider accu- 
racy of estimation in Besov sequence norms on the wavelet coefficients that 
imply good estimation of derivatives, as well as the function itself, and allow 
for losses in g-norms for < < 2. 

1.6. Alternative approaches and related bibliography. Finding a numer- 
ically simple and stable adaptive method for threshold choice with good 
theoretical and practical properties has proven to be elusive. A plethora of 
methods for choosing thresholds has been proposed (see, e.g., [49], Chap- 
ter 6). Apart from empirical Bayes methods, we note two other methods 
which have been accompanied by some theoretical analysis of their prop- 
erties and for which software can easily be written. In both cases we set 
Zk = Xk/aE, so that the thresholds are expressed on a renormalized scale. 

Stein's Unbiased Risk Estimate (SURE) aims to minimize the mean squared 
error of soft thresholding, and is another method intended to be adaptive to 
different levels of sparsity. The threshold isuRE is chosen as the minimizer 
(within the range [0, V^logn]) of 

n n 

(11) ij{t) =n+Y.Zk^t'-^J2 nzi < t'}. 

k=l k=l 

This does, indeed, have some good theoretical properties [19], but the same 
theoretical analysis, combined with simulation and practical experience, 
shows that the method can be unstable [19, 33] and that it does not choose 
thresholds well in sparse cases. 

The False Discovery Rate (FDR) method is derived from the principle 
of controlling the false discovery rate in simultaneous hypothesis testing 
[7] and has been studied in detail in the estimation setting [3]. Order the 
data by decreasing magnitudes: |^|(i) > 1-^1(2) > ••• > l-^l(n)) and compare 
to a quantile boundary: t^ = z{q/2 ■ k/n), where the false discovery rate 
parameter q E (0, ^]. Define a crossing index kp = max{A;: > t^}, and 
use this to set the threshold tp = i^.^- Although FDR threshold selection 
adapts very well to sparse signals [3], it does less well on dense signals of 
moderate size. 

Overall, we shall see that empirical Bayes thresholding has some of the 
good properties of both SURE and FDR thresholding and deals with the 
transition between sparse and dense signals in a stable manner. A detailed 
discussion of theoretical comparisons between the various estimators is pro- 
vided in Section 5.7. 

1.7. Structure of the paper. In Section 2 we discuss various aspects of the 
mixture priors used later in the paper. The priors themselves are specified, 
and details given of formulas needed for the Bayesian calculations in practice. 
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We take the opportunity to give additional practical details not included in 
[33]. In the next two sections the practical performance of the proposed 
method is investigated, by simulation in Section 3, and by applications to 
data sets arising in practice in Section 4. 

Section 5 contains the theoretical core of the paper for estimation of coef- 
ficient arrays under Besov sequence norm constraints. First, a wide-ranging 
result. Theorem 1, for the white noise model is stated. We then explore as- 
pects of the boundary wavelet construction, including ways of mapping data 
to scaling function coefficients at the finest level. This allows for the defi- 
nition of a boundary-corrected empirical Bayes estimator for the sampled 
data problem on a finite interval. The result we state about this estimator, 
Theorem 2, shows that it essentially attains the same performance as the es- 
timator for the white noise case. Finally, the correspondences between Besov 
sequence and function norms are set out, specifically addressing wavelets and 
functions on a bounded interval. For < g < 2, we relate risk measures ex- 
pressed in terms of wavelet coefficients to g-norms of appropriate derivatives. 

Section 6 contains the proofs of the main theorems, starting by reviewing 
theoretical results for the single sequence problem from [33], but cast into 
a form relevant for the present paper. These results are used to prove the 
white noise case theorem. The proof of the theorem for the sampled data case 
also makes use of approximation results for appropriate boundary-corrected 
wavelets given by Johnstone and Silverman [32]. Finally, Section 7 contains 
further technical details and remarks, including a discussion of the impor- 
tance of the bounded shrinkage assumption and results for the posterior 
mean estimator. 

1.8. Software. The methods described in [33] and in the current paper 
have been implemented as the EbayesThresh contributed package within the 
R statistical language [45] . The package and documentation can be installed 
from the CRAN archive accessible from http://www.R-project.org. Addi- 
tional description and implementational details are available in [34]. For a 
MATLAB implementation, see [6]. 

2. Mixture priors and details of calculations. In this section we discuss 
general aspects of the priors used in our procedure, and then review some 
theory for the single sequence case. Throughout, we use c to denote generic 
strictly positive constants, not necessarily the same at each use, even within 
a single equation. When there is no confusion about the value of the prior 
weight w, it may be suppressed in our notation. We write $ for the standard 
normal cumulative, and set $ = 1 — It is assumed throughout that the 
model and the observed data are renormalized so that the noise variance 

4 = 1. 
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2.1. Priors with heavy tails. Particular heavy-tailed densities that we 
shall consider for the nonzero part of the prior distribution are the Laplace 
density with scale parameter a > 0, 

7a{u) = ^aexp{-a\u\), 

and the mixture density given by 

(12) ^|e = i?~iV(0,T9"^ -1) with G~Beta(i,l). 
More explicitly, the latter density for fi has 

(13) jiu) = {2nr'/'{l-\um\u\)/^{u)} 

and has tails that decay as u~'^, the same weight as those of the Cauchy 
distribution. For this reason we refer to the density (13) as the quasi- Cauchy 
density. 

We shall mostly consider functions 7 that satisfy the following conditions: 

1. The function 7 is a symmetric unimodal density satisfying the condition 

(14) sup 

2. The quantity u'^'y{u) is bounded over all u. 

3. For some k G [1,2], 

roo 

y^'^iiy)'^ i{u)du 

is bounded above and below away from zero for sufficiently large y. 

The first of these conditions implies that the tails of 7 are exponential or 
heavier, while the second rules out tail behavior heavier than Cauchy. The 
third condition is a mild regularity condition. The conditions are satisfied if 
7 is the Laplace or quasi-Cauchy function, but not if 7 is a normal density. 

For the normal, Laplace and quasi-Cauchy priors, the posterior distri- 
bution of /X, given an observed Z, and the marginal distribution of Z are 
tractable, so that the choice of w by marginal maximum likelihood, and the 
estimation of /i by posterior mean or median, can be performed in practice, 
as outlined in the following paragraphs. We begin by setting out generic 
calculations for the relevant quantities, and then give specific details for 
particular priors. 

2.2. Generic calculations. 



-log7(u) 



< 00. 
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Posterior mean. In general, the posterior probability if post (^) = PifJ- / 0|Z = z) 
will satisfy 

(15) Wpostiz) = wg{z)/{wg{z) + (1 - w)ip{z)}. 

Define 

/i(/i|Z = z) = /(/i|Z = z,^/0), 
so that the posterior density 

fpostilAZ = Z) = {1- U;post)5o(/U) + WpostflilAz)- 

Let iJ,i{z) be the mean of the density fi{-\z). The posterior mean fl(z;w) is 
then equal to Wpost{z)fii{z). 

Posterior median. To find the posterior median w) of /i, given Z = z, 



let 



oo 



Fi{i^i\z) = / fi{u\z)du. 



If z > 0, we can find jl{z,w) from the properties 

fi{z]w)=^ \i Wpos\.{z)Fi{{)\z) <\, 

(16) 

Fi{jl{z\w)\z) = {2wpost{z)} ^ otherwise. 

Note that if Wpost{z) < \i then the median is necessarily zero, and it is 
unnecessary to evaluate -Fi(0|z). If z < 0, we use the antisymmetry property 
fi{-z,w) = -fi{z,w). 

Marginal maximum likelihood weight. The explicit expression for the 
function g facilitates the computation of the maximum marginal likelihood 
weight in the single sequence case. Define the score function S{w) =i'{w), 
and define 

(17) /5(-'-) = 7r-#ff^- 

^ ^ /-V ' ^ (^l-uj)ip{z) +wg{z) 

Then (3{z,w) is a decreasing function of w for each z, and 



(18) S{w) = J2(3iZ^, 



W) 



i=l 



Letting Wn be the weight that satisfies t{wn) = \/2Togn, the estimated 
weight vj maximizes i{w) over w in the range [tt;„,l]. It follows that, if S 
has a zero in this range, then S{w) = 0. Furthermore, the smoothness and 
monotonicity of S{w) make it possible to find w hy a binary search, or an 
even faster algorithm. The restriction on the range of w implies that the 
threshold t{w) < \/21ogn. 
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Shrinkage rules. The posterior median and mean are examples of esti- 
mation rules that yield an estimate of given Z = z. In general, a family 
of estimation rules r]{z,t), defined for all z and for t > 0, will be called a 
thresholding rule if and only if, for all t > 0, iliz^t) is an antisymmetric and 
increasing function of z on (— oo, oo) and r][z, t) = if and only if |2;| < t. It 
will have the hounded shrinkage property if and only if 

(19) z-{t + hQ)< i]{z, t)<z for all z>t 

for some constant independent of t. 

An immediate consequence of (19) is that | z — ri(z, t)\ < t + 6o for all z and 
t. For any given weight w, the posterior median will be a thresholding rule, 
with a threshold we denote by t{w), and will have the bounded shrinkage 
property under condition (14). More general thresholding rules may have 
advantages in some cases. For example, the hard thresholding rule, with a 
suitably estimated threshold, may have computational advantages and may 
preserve peak heights better, but we have not investigated this aspect in 
detail. Indeed, the choice of shrinkage rule and the choice of threshold are 
somewhat separate issues. The former is problem dependent and this paper 
is devoted to the latter. 

The posterior mean is not a thresholding rule, but has sufficient properties 
in common with the posterior median to allow similar theoretical results to 
be obtained, but under restrictions on the risk functions considered. 

2.3. Calculations for specific priors. The calculations set out above show 
that the key quantities are the marginal density g, the mean function ni{z) 
and the tail conditional probability function Fi. If 7 is the iV(0,r^) density, 
then g will be the iV(0, 1 + r^) density, and /Ui(z) = Ax, where A = r^/ (1 + 
r^). The function Fi{fi\x) will be the upper tail probability of the A^(Ax,A) 
density. 

For the Laplace distribution prior, we have 



which is a weighted sum of truncated normal distributions. Hence, it can be 
shown that, for z > 0, 



g{z) = ^aexp(ia2){e""^^>(z - a) + e"^l>(z + a)} 



and 



(20) 




if < 0, 
if /X > 0, 



(21) 



Hi{z) = z 



a{e-«^$(z - a) - e"^^>(z + a)} 
g-a2^(^ - a) + e'^^$(z + a) 
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For > 0, under the Laplace prior, we have 



_ e "■''^{iJ. - z + a) 



e-"^$(z - a) + e«^$(z + a) 
For the quasi-Cauchy distribution, we have 



5(.) = (2vr)-V2,-2^_,-.V2) 



and 



-1 



After some manipulation. 



For the Laplace prior, the equation Fi{fi{z;w)\z) = {2ii;post}~"'^ in (16) can 
be solved explicitly for jl{z;w), making use of the function In the case 
of the quasi-Cauchy prior, the equation has to be solved numerically. 

3. Some simulation results. A simulation study was carried out for the 
regression models that are by now standard in the consideration of wavelet 
methods and are given in [18]. Simulations from each of the four models were 
carried out, for each of two noise levels. For "high noise," the ratio of the 
standard deviation of the noise to the standard deviation of the signal values 
is I . In the "low noise" case the ratio is y . This complements the simulations 
for the single sequence case reported in [33]. The S-PLUS code used to 
carry out the simulations is available from the authors' web sites, enabling 
the reader both to verify the results and to conduct further experiments if 
desired. 

3.1. Results for the translation-invariant wavelet transform. In Table 1 
various wavelet methods, all making use of the translation- invariant wavelet 
transform, are compared. For each model and noise level, 100 replications 
were generated. In each replication, the function was simulated at 1024 
equally spaced points tj. The same normal noise variables were used for 
each of the models and noise levels. The error reported for each method 
considered is 



where is the noise variance in each case, and this explains why the results 
for "low noise" are apparently inferior to those for "high noise." The default 
choices of wavelet, boundary corrections and so on, given in the S-PLUS 
Wavelets function waveshrink, were used. For each realization, the noise 



1024 




T.{nti)-f{u)? 



i=l 
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variance is estimated using the median absolute deviations of the wavelet 
coefficients at the highest level. The default choice of boundary treatment 
is to use periodic boundary conditions, and such boundary conditions have 
to be used for current implementations of the translation-invariant wavelet 
transform. Detailed consideration of the use of the idea of the translation- 
invariant transform, in combination with boundary correction, is an inter- 
esting idea for future research. 

For the Laplace prior 7, with both w and the scale parameter a estimated 
level-by-level by marginal maximum likelihood from the data, estimates were 
constructed using both the posterior median and the posterior mean. For the 
quasi-Cauchy prior, estimates using the posterior median were calculated. 
The posterior median for the mixed Gaussian prior was also calculated; as 
for the Laplace prior, both w and the scale parameter were estimated from 
the data. 

Three other methods based on the translation-invariant wavelet transform 
were considered: SURE applied to 4 and 6 levels of the transform, universal 
soft thresholding applied to 6 levels of the transform, and the false discovery 
rate approach with various values of the parameter q. Whenever the false 
discovery approach is used in the wavelet context, the method is applied 
separately at each level, a method derived from [2]. The same parameter q 

Table 1 

Average over 100 replications of summed squared errors over 1024 points for various 
models and methods. All the wavelet-based estimators use the translation-invariant 
wavelet transform. The standard error of each of the entries is at most 2% of the value 

reported 



Method 




High 


noise 






Low 


noise 




bmp 


blk 


dop 


hea 


bmp 


blk 


dop 


hea 


Laplace (median) 


171 


176 


93 


41 


212 


164 


109 


57 


Quasi-Cauchy (median) 


177 


185 


97 


40 


221 


169 


115 


56 


Gaussian (median) 


223 


178 


108 


42 


296 


247 


150 


65 


Laplace (mean) 


181 


182 


100 


45 


214 


175 


115 


62 


SURE (4 levels) 


243 


205 


140 


73 


299 


255 


181 


95 


SURE (6 levels) 


237 


199 


123 


45 


296 


252 


167 


71 


Univ soft (6 levels) 


701 


417 


229 


67 


997 


749 


386 


110 


FDR (q = 0.01) 


170 


198 


97 


43 


223 


164 


109 


56 


FDR (q = 0.05) 


169 


173 


93 


39 


223 


163 


110 


53 


FDR (g = 0.1) 


177 


168 


93 


39 


235 


174 


116 


53 


FDR [q = 0.4) 


264 


212 


127 


50 


353 


273 


181 


72 


Spline 


1294 


433 


265 


51 


6417 


1826 


905 


117 


Tukey 


545 


330 


286 


246 


1892 


655 


425 


257 
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is used at each level, but the resulting estimated threshold may, of course, 
vary. 

Comparisons are also included with two standard nonwavelet methods: cu- 
bic smoothing splines using GCV (smooth. spline in S-PLUS) and Tukey's 
4(3RSR)2H method, running medians with twicing, the default S-PLUS 
smooth. 

The standard error of each of the entries in the table is at most 2% of the 
value reported, so the values are correct to about 2 significant figures. The 
two standard nonwavelet methods both perform badly. Not surprisingly, 
given that it is specifically designed for smooth functions, the smoothing 
spline method fails disastrously on discontinuous and spiky signals. Neither 
method is good at separating signal from noise in the low noise case. The 
Tukey method is, to some extent, competitive with universal thresholding for 
the more inhomogeneous signals, but cannot adapt to the smoother behavior 
of the HeaviSine signal. 

As for the methods based on the wavelet transform, the performance of the 
posterior mean estimator with the Laplace prior is consistently slightly worse 
than that of the posterior median. The universal thresholding method does 
not compare well, and SURE also gives noticeably worse performance than 
the Laplace and quasi-Cauchy empirical Bayes methods. The FDR method 
is competitive, provided the parameter is chosen appropriately. For these 
signals and sample size, q = 0.05 and 0.1 give good performance, but the 
performance is worse in some cases if q = 0.01 and considerably worse if q = 
0.4. We shall see in subsequent examples that the choice of this parameter is 
crucial to the performance of the FDR method, and that, in other situations, 
the relative performance of the FDR method is, in any case, not quite as 
good. 

Within the translation-invariant wavelet transform, the observed coeffi- 
cients are not independent. Benjamini and Yekutieli [8] propose a modi- 
fication to the FDR method to take account of dependence between ob- 
servations, replacing q by q/J2h=ik~^^ where M is the number of param- 
eters under consideration. In the translation-invariant wavelet transform, 
the number of coefficients at each level is equal to the number of original 
observations, 1024 in the simulation example considered, so the correction 
factor is X^i''^^ ^ Therefore, the results reported for q = 0.05 would 
correspond to q = 0.05 x 7.5 = 0.375 within the Benjamini- Yekutieli proce- 
dure. Since we are choosing the q parameter arbitrarily in any case, this 
recalibration of the q parameter does not affect our general conclusions. 
However, it does mean that the precise numerical value of g = 0.05 in the 
translation-invariant case cannot necessarily be translated directly to the 
standard discrete wavelet transform. 

The mixed Gaussian prior model does not fit the theoretical assumptions 
of this paper and it can be seen that its performance is not as good as the 
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Table 2 

Difference in summed square errors between the methods indicated and the '^Laplace 
(median/' method, measured in terms of the standard error of the difference estimated 

from 100 replications 



Method 




High 


noise 






Low 


noise 




bmp 


blk 


dop 


hea 


bmp 


blk 


dop 


hea 


Quasi-Cauchy (median) 


14 


16 


10 


-2.9 


16 


12 


15 


-2.6 


Laplace (mean) 


15 


6 


9 


9 


2.7 


16 


11 


9 


SURE (6 levels) 


49 


19 


26 


6 


45 


60 


46 


16 


Univ soft (6 levels) 


101 


124 


81 


35 


100 


97 


77 


74 


FDR [q = 0.01) 


-1.6 


17 


5 


5 


13 


-1.0 


0.8 


-0.9 


FDR (g = 0.05) 


-3.0 


-2.4 


-1.1 


-5 


12 


-1.7 


2.7 


-9 


FDR (g = 0.1) 


6 


-6 


-0.3 


-6 


15 


8 


10 


-9 


FDR {q = 0.4) 


27 


12 


15 


7 


36 


30 


27 


7 



heavy-tailed priors. It is clear that the tail requirements on 7 have some 
bearing on the performance of the empirical Bayes approach. More detailed 
investigation of this issue would be an interesting topic for further research. 

Because the same noise values are used for each model, there is correla- 
tion between the various values in Table 1. Comparisons of methods with 
the Laplace (median) method on a paired-sample basis are given in Ta- 
ble 2. It can be seen that the empirical Bayes method with the Laplace 
prior using the posterior median decisively outperfoms the other methods, 
except for the HeaviSine function, where the quasi-Cauchy prior performs 
very slightly better, but there is little to choose between the Laplace and 
quasi-Cauchy priors. Of the four FDR methods, the inferior performance 
for q = 0.01 and 0.4 is significant. For q = 0.05 and 0.1, the results are 
more equivocal, but the cases for which the FDR method underperforms 
are the ones with the most significant difference. Some further comparisons 
between these FDR methods and the empirical Bayes methods will be made 
below. 

3.2. Results for the standard discrete wavelet transform. In order to eval- 
uate the advantage of the translation-invariant transform, the same simu- 
lated data were also smoothed using methods based on the standard trans- 
form. The results are shown in Table 3. Additional comparisons are in- 
cluded, with the two block thresholding methods considered by Cai and 
Silverman [10], and with the QL method of Efromovich [23]. The block 
thresholding methods choose thresholds by reference to information from 
neighboring coefficients within the transform. In the case of NeighCoeff, 
only the two neighboring coefficients are used when considering a particular 
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Table 3 

Average over 100 replications of summed squared errors over 1024 points for various 

models and methods. In each case a standard wavelet transform was used. The two 
nonwavelet methods are not included, because they give the same results as in Table 1. 
For comparison, the results for the Laplace prior using the translation-invariant 
transform are repeated from Table 1, in italics 



Method 












Low 






bmp 


blk 


dop 


hea 


bmp 


blk 


dop 


hea 


Laplace (median) 


















translation-invariant 


171 


176 


93 


41 


212 


164 


109 


57 


Laplace (median) 


278 


245 


147 


53 


338 


311 


204 


76 


Quasi-Cauchy (median) 


277 


252 


150 


54 


324 


301 


200 


73 


Gaussian (median) 


328 


252 


158 


56 


400 


361 


241 


87 


Laplace (mean) 


257 


228 


140 


57 


304 


278 


190 


79 


NeighBlock 


462 


406 


148 


67 


436 


485 


207 


125 


NeighCoeff 


324 


320 


145 


60 


316 


345 


207 


91 


QL 


359 


310 


175 


58 


411 


366 


243 


82 


SURE (4 levels) 


317 


248 


183 


97 


393 


331 


247 


117 


SURE (6 levels) 


312 


247 


167 


69 


399 


339 


235 


94 


Univ soft (6 levels) 


937 


484 


277 


76 


1444 


931 


534 


121 


FDR (q = 0.01) 


331 


307 


169 


60 


387 


382 


231 


83 


FDR (q = 0.05) 


299 


278 


163 


57 


347 


334 


216 


78 


FDR (g = 0.1) 


301 


271 


162 


60 


356 


330 


221 


81 


FDR [q = 0.4) 


395 


333 


221 


97 


477 


420 


310 


130 



coefficient, wliile, for NeigliBlock the data are processed in blocks and infor- 
mation is drawn from neighboring blocks. At coarse scales the QL method 
uses a thresholding rule with threshold equal to the standard deviation of 
the coefficients, while at finer levels the coefficients are thresholded at a 
threshold that increases up to the universal threshold as the level increases, 
but at the same time the proportion of coefficients allowed to be nonzero is 
also controlled, more stringently the higher the level. 

Several interesting conclusions can be drawn from this table. In this case, 
the posterior mean generally yields superior estimates to the posterior me- 
dian. The NeighCoeff method is the better of the two block thresholding 
methods, but generally underperforms the Laplace prior /posterior mean 
method. The QL method performs well for the HeaviSine signal, but for 
the others is not so competitive. In this context, the relative performance of 
the FDR method is not as good as previously, but the importance of choosing 
the parameter q appropriately remains. In general, it is clear how important 
is the use of a translation-invariant transform. The empirical Bayes method 
with a Gaussian prior was also tried in this context, and the results were, 
again, somewhat inferior to those for the heavy-tailed priors. 
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We can use Tables 1 and 3 to give another measure of performance. Let 
denote the value in cell (j, k) of the table, the error measure of method 
j applied in case k. Then define the overall performance of method j by 
R{j) = minfc(min£ r^fc/r^fc). The ratio min^ r^fc/r^^ quantifies the relative per- 
formance of method j on case /c, by comparing it with the best method for 
that case. The minimum efficiency score R{j) then gives the loss of efficiency 
of estimator j on the most challenging case. For the translation-invariant 
transform, the Laplace (median) case has a minimum efficiency score of 
93%, while the FDR method with q = 0.05 scores 95%. The quasi-Cauchy 
method scores 91% and the FDR with q = 0.1 scores 90%. 

However, if we turn to the standard transform, the results are more de- 
cisive, with scores of about 90% for the empirical Bayes Laplace and quasi- 
Cauchy median methods, but only 82% for the FDR with q = 0.05 and 84% 
for FDR with g' = 0.1. It should be noted that the scores of around 90% for 
the empirical Bayes methods are only because the empirical Bayes method 
that is very best varies slightly between cases considered. But to be spe- 
cific, the Laplace (median) method consistently outperforms all the FDR 
methods. 

4. Comparisons on illustrative data sets. Li this section the simulations 
are complemented by the consideration of three illustrative examples drawn 
from practical applications. Taking account of both the simulations and the 
practical comparisons, the empirical Bayes method, using the Laplace prior 
and the posterior median estimate, is fully automatic and, on each of the 
simulation studies considered as a whole, and on the practical illustrations, 
performs either best or nearly as well as the best method in each setting. 
The FDR method with q = 0.05 is slightly superior on the first simulation 
study, but at the expense of more substantial underperformance otherwise, 
at least on the cases we have considered. 

4.1. Inductance plethysmography data. Our first practical comparison 
uses the inductance plethysmography data described in [39]. The data were 
collected by the Department of Anaesthesia, Bristol University, in an inves- 
tigation of the breathing of patients after general anaesthesia. For further 
details, and the data themselves, see the help page for the ipd data in the 
WaveThresh package [40]. 

Plots of the original data and the curve estimate obtained using the 
Laplace prior method are shown in Figure 1. The results for the Laplace 
and quasi-Cauchy priors are virtually identical, so only the Laplace results 
are reported in detail here. The aim of adaptive smoothing with data of this 
kind is to preserve features such as peak heights as far as possible, while 
eliminating spurious rapid variation elsewhere. Abramovich, Sapatinas and 
Silverman [4] found that their BayesThresh method performed better in this 
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Fig. 1. Top panel: the inductance plethysmography data. Bottom panel: the effect of 
smoothing the inductance plethysmography data with the Laplace prior method. 

regard than various other wavelet methods, but that for best results a subjec- 
tive adjustment of their parameter a from a = 0.5 to a = 2 gave preferable 
results. The MML approach gave virtually the same results whether the 
quasi-Cauchy or Laplace prior is used. 

The efficacy of the various methods in preserving peak heights is most 
simply judged by the maximum of the various estimates, the height of the 
first peak in the curve. The standard BayesThresh method (a = 0.5) yields 
a maximum of 0.836, while subjectively adjusting to a = 2 gives 0.845. The 
empirical Bayes method gives 0.842. Overall, the empirical Bayes method 
gives results much closer to the adjusted BayesThresh; the maximum dis- 
tance from the empirical Bayes curves to the adjusted BayesThresh curve is 
about one-third that from the original BayesThresh estimate. The efficacy 
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of the various methods in deahng with the rapid variation near time 1300 
can be best quantified by the range of the estimated functions over a small 
interval near this point. The standard BayesThresh method has a "glitch" 
of range 0.08, while, for both the adjusted BayesThresh and the empirical 
Bayes method, the corresponding figure is under 0.06, a substantial if not 
dramatic improvement. 

The FDR method with various parameters was also applied. As in the 
simulations, the FDR approach is applied separately to each level, with the 
same parameter q at each level. For all the FDR q parameters considered, 
the maximum of the estimated curve is between 0.842 and 0.843, but the 
range of the estimated curve near time 1300 is around 0.075. Thus, FDR 
competes well with empirical Bayes on preserving the peak height, but at 
the cost of inferior treatment of presumably spurious variation elsewhere. 

Another comparison between the various methods can be made by con- 
sidering the threshold that they use at various levels of the transform. 
The threshold is not a full description of the procedure, especially in the 
BayesThresh and Laplace prior cases where there are two parameters in the 
prior, but the threshold is a useful univariate summary of a method of pro- 
cessing wavelet coefficients. Figure 2 gives the comparison for the various 
methods applied to these data. It can be seen that, at the top four levels, 
the empirical Bayes methods track the adjusted BayesThresh method quite 
closely. The standard BayesThresh uses very high thresholds, which may be 
the reason why it smooths out the peak height somewhat. At the coarser 
levels, the empirical Bayes methods automatically adjust to much lower 
thresholds, reflecting a way in which the signal is less sparse at these levels, 
and thus allowing variation at these scales to go through quite closely to the 
way it is observed. None of the FDR parameter choices gives the degree of 
adaptivity of threshold to level shown by the empirical Bayes methods. 

To conclude the comparison between BayesThresh and the empirical Bayes 
method, the subjectively adjusted BayesThresh method already yielded very 
good results for these data, but the basic message of this discussion is that 
the empirical Bayes method yields results virtually as good as those of the 
best BayesThresh method, but without any need for subjective tinkering 
with the parameters. In addition, the use of maximum likelihood to esti- 
mate the prior parameters is a less ad hoc approach than the fitting method 
used by the BayesThresh approach. 

4.2. Ion channel data. A comparison between empirical Bayes and SURE 
is provided by considering a segment of the ion channel data discussed, for 
example, by Johnstone and Silverman [30]. Because these are constructed 
data, the "true" signal is known. See Figure 3. The thresholds chosen by 
SURE (dashed line) are reasonable at the coarse scales 6, 7 and 8, but are 
too small at the fine scales 9 to 11 where the signal is sparse, show some 
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Fig. 2. Thresholds chosen for the top six levels of the wavelet transform of the induc- 
tance plethysmography data by various methods. Upper figure: e: empirical Bayes, Laplace 
prior; c: empirical Bayes, quasi-Cauchy prior; b: BayesThresh; t: BayesThresh, subjec- 
tively tinkered, with a = 2. Lower figure: False Discovery Rate method with parameters 
5 = 0.01,0.05,0.1 and 0.4. 



instability in the way they vary, and lead to insufficient noise removal in the 
reconstruction. By contrast, the empirical Bayes threshold choices increase 
monotonically with scale in a reasonable manner. In particular, the univer- 
sal thresholds at levels 9 to 11 are found automatically. Two reconstructions 
using the same EB thresholds are shown in panel (b): one using the posterior 
median shrinkage rule, and the other using the hard thresholding rule. The 
hard threshold choice tracks the true signal better. The choice of threshold 
shrinkage rule is problem dependent, and beyond the scope of this paper. It 
is somewhat separate from the issue of setting threshold values. 
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Fig. 3. Left panel: Estimated threshold t plotted against level j; dashed line: SURE 
thresholds, solid line: EB thresholds. Right panel: Segment of the ion channel signal and 
three estimates. Both solid lines use EB-thresholds, but one uses a hard thresholding rule 
and tracks the true signal better, while the other uses posterior median shrinkage. The 
result of using SURE thresholds is plotted as the dashed line, and the dotted line gives the 
true signal. 
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Table 4 

Percentage of errors in estimation of ion channel gating signal. The 
errors are the average over ten separate sequences of length 4096 

drawn from the data provided by Eisenberg and Levis. The variances 
of the wavelet coefficients at each level were estimated separately 



Decimated? 


N 


Y 


Laplace (median) 


2.4 


3.0 


Quasi-Cauchy (median) 


2.7 


3.5 


Laplace (mean) 


2.3 


2.6 


SURE (4 levels) 


2.2 


3.1 


SURE (6 levels) 


2.3 


3.2 


Univ soft (6 levels) 


6.0 


7.5 


FDR (g = 0.01) 


3.1 


4.4 


FDR (g = 0.05) 


2.8 


3.9 


FDR (g = 0.1) 


2.6 


3.7 


FDR (g = 0.4) 


2.3 


3.6 



Spline 4.4 

Tukey 11 

AWS 6.2 

Special 2.0 



A systematic quantitative comparison is given in Table 4. For each method 
considered, ten sequences of length 4096 drawn from the original data were 
analyzed. The variances of the wavelet transform at the various levels were 
estimated by separate consideration, imitating the effect of using a sequence 
of observations with no signal to calibrate the method. For each method, 
the curve estimated by the smoothing method was then rounded off to the 
nearest of zero and one to give the final estimate. The figures given are the 
average percentage error over the ten sequences considered. 

As an aside, we note that our theoretical results, of course, do not specif- 
ically include this zero-one loss of the estimate rounded to the nearer of 
zero or one. However, we do consider Lq losses for q near zero, which catch 
something of the flavor of discrete losses, in view of the fact that the limit 
as g — > of the qlh. power of the Lq norm is a zero-one loss. 

Comparisons were made with the special-purpose method developed specif- 
ically for this problem by the originators of the data, and with standard 
smoothing methods, including the AWS method of Polzehl and Spokoiny 
[43]. The special-purpose method achieves an error rate of 2.0%; because of 
the specificity of this method, it is perhaps not surprising that it cannot be 
beaten by the more general-purpose methods we consider, but some of the 
translation-invariant wavelet methods come close. In this case the posterior 
mean slightly outperforms the posterior median, and other good methods 
are SURE and FDR with q = 0.4. If we use the parameter values q = 0.05 
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and 0.1 appropriate in our main simulation, then the results are inferior, 
underlining the need to tune the FDR parameter to the problem at hand. 

4.3. An image example. Turning finally and briefly to images, Figure 4 
shows the effect of applying empirical Bayes thresholds to a standard image 
with Gaussian noise added. The thresholds are estimated separately in each 
channel in each level. Nine realizations were generated, and the signal to 
noise ratio of the estimates (SNR = 201ogio(||/ — / II2/II/II2)) calculated for 
both thresholding at 3a e and for the empirical Bayes thresholds. Smaller 
SNR corresponds to poorer estimation, though, of course, this quantita- 
tive measure does not necessarily correspond to visual perception of relative 
quality. The actual images shown correspond to the median of the nine ex- 
amples, ordered by the increase in SNR between the 3a e threshold approach 
and the empirical Bayes approach. 

For the example shown, the EB thresholds are displayed in the table 
below. They increase monotonically as the scale becomes finer and yield 
SNR = 33.83. They are somewhat smaller in the vertical channel, as the 
signal is stronger there in the peppers image. Fixing the threshold at 3aE in 
all channels leads to small noise artifacts at fine scales (SNR = 33.74), while 
fixing the threshold at aE^J"^ logn (not shown) leads to a marked increase 
in squared error (i.e., reduced SNR). 



Channel/Level 


3 


4 


5 


6 


7 


Horizontal 





1.1 


2.3 


3.2 


4.4 


Vertical 








2.0 


3.0 


4.4 


Diagonal 





1.7 


2.7 


4.1 


4.4 



5. Theoretical results. We now turn to the theoretical investigation of 
the proposed empirical Bayes method for curve estimation using wavelets. 
In doing so we distinguish between various different models for observed 
wavelet coefficients and for the theoretical coefficients of interest. Suppose 
throughout that level J is such that the sum of the lengths of all the coeffi- 
cient vectors below level J is equal to N . 

5.1. Models for the observed data. In the white noise model, it is assumed 
that we have independent observations Yjk ~ N{9jk, N~^) of the wavelet co- 
efficients 6jk themselves. Because of the orthonormality properties of the 
wavelet decomposition, observations of this kind would be obtained by car- 
rying out a wavelet decomposition of the function f{t) + N~^^'^ dW{t), where 
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Fig. 4. Translation invariant hard thresholding applied to a noisy version of the '^pep- 
pers" image. For original image and noisy version see, for example, [36], Figure 10.6. 
Left panel; fixed threshold at Sas- Right panel: Level and channel dependent EB thresh- 
olds as shown in the table. The image obtained by fixed thresholds contains spurious high 
frequency effects that are largely obscured by the printing process. For a clearer compar- 
ison, the reader is recommended to view the images in the online version available from 
the authors^ web sites. 

dW{t) is a white noise process. In our main theory, we only use the Yjk at 
levels j < J, setting coefficients at higher levels to be zero. 

The other model of practical relevance is the sampled data model, where 
we assume that we have data Xi = f{i/N) + ej, where are independent 
A^(0, 1) random variables. Let 9 be the discrete wavelet transform of the 
sequence N~^/'^f{ti), and Y that of the sequence N^'^/'^X, so that ^^fc ~ 
N[9jk, N"^). In much of the current statistical literature, the distinction 
between the white noise coefficients Y^-fc and the sampled-data coefficients 
l^-fc is often glossed over, as is that between the function coefficients 9 and 
the time-sampled coefficients 9. The theoretical framework within which we 
work is, generally, to assume that 

(22) J2 \^jkf < 0^2^"^^ for ah j, 

k 

corresponding to membership of the underlying function / is a particular 
smoothness class. The first case we shall consider is where we observe Y and 
estimate 9. 

The other cases all make use of the sampled-data coefficients Y. If we 
retain the constraint (22) on the underlying function, we can show that, pro- 
vided the wavelet basis is chosen appropriately, the discretization involved in 
the sampled-data construction does not affect the order of magnitude of the 
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accuracy of the estimates. This is the case whether we consider the estimates 
9{Y) of the coefficients to be estimates of the wavelet coefficients 9 of the 
function itself, or use the estimated coefficients to reconstruct an estimate of 
the sequence f(i/N) via the discrete wavelet transform 9. Unless we impose 
periodic boundary conditions, a key prerequisite for the consideration of the 
sampled data model is the development of appropriate boundary-corrected 
bases with corresponding preconditioning of the data near the boundaries, 
and we consider this aspect below. 

A final model is the situation where it is the sequence of values f{i/N) 
that is of primary interest, but we place the Besov array bounds on the 
discrete wavelet transform 9 of this sequence rather than on the underlying 
function. We replace (22) by the constraint 

(23) I^Jfel^ ^ CP2~''P^ for all j < J. 

k 

In this case we only require orthonormality of the discrete wavelet transform, 
but the condition (23) depends both on the function / and on the particular 
under consideration. The asymptotic theorem should be thought of as 
a "triangular array" result, rather than a limiting result for a particular 
function /. The formalism of the proof is identical to the white noise case, 
except there is no need to consider terms for j > J and this eliminates one 
of the error terms in the result. 

5.2. Array results under Besov body constraints. Suppose that 9jk is a 
coefficient array, defined for j = L — 1, L, L + 1, . . . and < k < Kj , for 
Kj satisfying 2^"^ < Kj < 2^ for j>L and 2^"^ < Kl-i < 2^. Let N = 
J2L-i<j<j -^j integers J, and consider limits as J— > oo. For given J, as- 
sume we have observations Yjk ~ N{9jk, N~^a'^) for j = L — 1, L, . . . , J — 1, 
< k < Kj . The variance a'^ is assumed to be fixed and known, and without 
loss of generality we set a\ = l. 

Let 9j denote the vector {9jk : < k < Kj) and define Yj similarly. The 
vector 9l-i is estimated by Il-i- For L < j < J, each vector 9j is estimated 
separately by the empirical Bayes method described above. Set fi = N^/'^9j 
and Z = N^^'^Yj, and obtain an estimate of fi using a possibly modified 
threshold with parameter A>0.1iA = 0, then the threshold is not modified, 
while if A> 0, the threshold is as defined in (7). The threshold is that cor- 
responding to the posterior median function, but provided this value of the 
threshold is used, the estimation can be carried out by any thresholding rule 
satisfying the bounded shrinkage property (19). We then set 9j = N~^^'^ fi. 
For j > J, finer scales than the observations assumed available, we set 9j = 0. 

The overall risk is defined to be 

oo 

(24) Rn,,A0) = EWOl^i - 9l^i\\1 + ^ r'i^E\\9, - 9j\\l 
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Under suitable conditions on the wavelet family, this norm dominates a 
g-norm on the ath derivative of the original function ifs = cr+^ — |; see 
Section 5.6. The constant by which the contribution of the scaling coefficients 
is multiplied is somewhat arbitrary, and may be altered without affecting 
the overall method or results. We can now state the main result, which 
demonstrates that the empirical Bayes method attains the optimal rate of 
convergence of the mean gth-power error for all values of q and p down to 
0. 

The result also yields smoothness properties of the posterior estimate. It 
demonstrates, for values of a and q satisfying the conditions of the theorem, 
that the coefficient array 6 has finite bg ^ norm and, hence, under suitable 
conditions on the wavelet has crth derivative bounded in g-norm. 

Theorem 1. Assume that < p < oo and < q < 2, and that a > |. 
Suppose that the coefficient array 9 falls in a sequence Besov hall hp^^{C) 
so that 

(25) ll^jllp < C2-°^' for all j, 
where a = a + ^ — |>|. Let -5 = o" + ^ — | and set 

r = [a — a) / {2a + 1) and r' = {a — s)/2a. 

Assume that a > and that a — a> max(0, - — -). Assume also that sq < A. 

Then, for some quantity c which does not depend on C or N (but may 
depend on a,p,a,q, as well as 7, A and the wavelet family), the overall 
q-norm risk satisfies 

(26) RN,gA^) < c{A(C, N) + cm-'-"'' + log'' N}, 
where 

{fj(l-2r)qjy-rq ^ 
C(l-2r')g^-Aiogr'g+l^^ 

r" = a-a - (- --)+ and < u < 4. 

Remarks. If q <p, then necessarily ap > sq since a > s. However, if 
q> p, then the three cases in (27) correspond, respectively, to the "regular," 
"critical" and "logarithmic" zones described in [22]. 

Note first that, by elementary manipulations, 

, ap — sq 

r — r = -, 

apq{2a + 1) 

so the cases in (27) could equally be specified in terms of the relative values of 
r' and r. Also, a — s = a — a— - + ->0, so r'>0 and r" = min{a — a,a — s}. 



if ap > sq, 
if ap = sq, 
if ap < sq. 
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The condition sq< A will be satisfied for all q in (0, 2] if ^ > 2a. A particular 
situation in which this will hold is the "standard" case ^ = and a = 0. 

The rates in (27) agree with the lower bounds to the minimax rates derived 
in Theorem 1 of [22], and so the first term of (26) is a constant multiple of the 
minimax dependence of the risk on the number of observations subject to 
the Besov body constraints. For fixed C the other terms are of smaller order. 
The same rates arise in [17], which demonstrates that suitable estimators, 
dependent on a, attain these rates for q = 2. 

First consider the term. Using the conditions a> ^ and a> a>0, 

we have a — s = 2ar' > r' and a — a = (2a + l)r > r, so that r" > min{r, r'}. 
If a > ^, then the inequality is strict and the A^"** term will be of lower 
polynomial order than A(C, N) in every case. If o = ^ and r' < r, we will 
have r" = r', but, for fixed C, A(C, N) will still dominate because of the 
logarithmic factor. 

Since r < the N^'^^'^log'^ N term will always be of smaller order than 
A(C, N). This term shows that, even if the Besov space constant C is allowed 
to decrease as N increases, or is zero, we have not shown that the risk can 
be reduced below a term of size N~'^^'^, with an additional logarithmic term 
in certain cases. The exact definition of is 

r 0, if sq < A, 

(28) i^=l{q+l)/2, ifsq = A>0, 

I 3+ (g-pA2)/2, ifsq = A = 0. 

Truncating risk at fine scales. Consider the estimation of 6 from the 
transform Y, subject to the discretized constraints (23). In this case there 
is no need to consider levels j > J in the risk, and the condition a> ^ , 
equivalent to a > 1/p, can be relaxed to a > 0, equivalent to o > | — ^. 
Define 

J-i 

(29) RN,,Af) = E\\Ol^i{Y) - h^iW', + E 'i'''E\\¥y) - 
We then have the simpler result 

(30) RN,,Af) < 4MC, N) + AT--^/^ log-^ N}. 

Define f{i/N) to be the sequence obtained by the inverse discrete wavelet 
transform applied to N^/'^6{Y). In the "standard" case a = A = and q = 2, 
the orthogonality of the wavelet transform allows us to deduce from (30) 
that, subject to the constraint (23), 

TV 

N-'j2E{fm)-fm)r 

i=l 

< ^|^2/(2a+l)^-2a/(2a+l) ^ AT"! (log iV)4-( V2)(pA2) | ^ 
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which imphes (9). 

White noise model when fine scale observations are available. If we as- 
sume that we have data Yjk for all levels, not just for j < J, then we can 
again relax the lower bound a> ^ to a > 0. For definiteness, estimate 6j 
from the data for levels up to j = J^, and set the estimate to zero for higher 
levels. Then we will have the result 

(31) RN,qA^) < c{A(C7,iV) + C7'?exp(-c'log2iV) +iV~'?/2iog2i/^| 

for a suitable c' > 0. The second term in (31) decays faster than polynomial 
rate in A'^ for any fixed C. 

The proof of Theorem 1, together with the minor modifications required 
to prove (30) and (31), is given in Section 6.2 below. 

5.3. Wavelets whose scaling functions have vanishing moments. Turn 
now to the issue of developing theory for the sampled data case subject 
to retaining the constraints on the function / itself. Crucial to our theory 
are wavelets constructed from a scaling function (j) with vanishing moments 
of order 1, 2, . . . , i? — 1, and R continuous derivatives, for some integer R. The 
corresponding mother wavelet ■0 is orthogonal to all polynomials of degree 
i? — 1 or less, and both and '0 are supported on the interval [—S + 1, S] for 
some integer S > R. Coiflets, as discussed in Chapter 8.2 of [16], are an ex- 
ample of wavelets constructed to satisfy these properties. The zero moments 
of the scaling function are used to control the discretization error involved 
when mapping observations to scaling function coefficients at a fine scale. 
Note that many standard wavelet families have scaling functions with nearly 
vanishing moments of orders 1 and 2; an issue for future investigation is the 
tradeoff in finite samples between relaxing the condition of exactly vanishing 
moments and using wavelets of narrower support than coiflets. 

Unless we are happy to restrict attention to periodic boundary condi- 
tions, it is necessary to modify the wavelets and scaling functions near the 
boundary, and, hence, the filters used in the corresponding discrete wavelet 
transform. A construction following Section 5 of [14] can be used to perform 
this modification, while maintaining orthonormality of the basis functions. 
We review the application of the construction; for fuller details and proper- 
ties see [32]. 

Remarks. 1. If the restriction to (boundary modified) coifiets is needed 
for our theory, why is inferior behavior not observed for other Daubechies 
wavelet families in practice? In fact, it follows from [26] that if one recenters 
a Daubechies scaling function (p at its mean r = J x(j){x) dx, then the second 
moment necessarily vanishes. Thus, up to a horizontal shift r, one obtains 
two vanishing moments "for free." 
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2. The approach to sampled data taken by Donoho and Johnstone [20] 
works for a broad class of orthonormal scaling functions, by a less direct con- 
struction relating white noise and sampled data models through multiscale 
Deslauriers-Dubuc interpolation. 

The construction is based on boundary scaling functions cf)^ for k = 
—R, —R+1, . . . , R — 2, R — 1, and boundary wavelets ip^ for k = — S + 1, — S + 
2, . . . , S — 1, S — 2. The support of these functions is contained in [0, 25" — 2] 
for k>0 and in [—(25' — 2),0] for A; < 0. We fix a coarse resolution level L 
such that 2S <2^ . At every level j > L, there are 2-' — 2{S — R—1) scaling 
functions, defined by 

(t>jk{x) = 2^/^4>^{2^x), keO:{R-l), 
(t>jk{x) = 2^l^<^(2?x -k), ke{S-l): {2^ - S), 

= 2^/^^t2.{2^{x - 1)), k G (2^- - R) : {2^ - 1), 

and 2^ wavelets 

= 2JV2^f (2^x), A; GO: (5- 2), 

il^jkix) = 2^'/2^(2Jx -k), ke{S -l):{2^ - S), 

i,,u{x) = 2^-/2^f_2^,(2J-(x - 1)), G (2^- - 5 + 1) : {2^ - 1). 

All these functions are supported within [0, 1] . There are no scaling func- 
tions defined for R < k < S — 1 or for 2^ — S < k < 2^ — R, but there are no 
such gaps in the definition of the wavelets. The S — 1 wavelets at each end 
are boundary wavelets, and have the same smoothness, on [0,1], and van- 
ishing moments as the original wavelets. The 2^ — 2S interior wavelets are 
not affected by the boundary construction, and depend only on the 2"^ — 25" 
interior scaling functions at the finest scale. At the coarsest level L, there 
will be 2^ — 2(5 — R—1) scaling coefficients; denote by JCl-i the set of 
indices for which the scaling functions (pik are defined. At every level j > L, 
define K.^ to be the set of k for which il^j^ is a scaled version of a boundary 
wavelet, and /Cj to be the set of k for which ipjk is a scaled version of ^/^ 
itself. 

Given a function / on [0, 1] , we can now define the wavelet expansion of 
/by 

oo 2J-1 

(32) / = ^L-l,k4'L,k + YY1 ^jk'4'jk, 

keKL-i j=L k=0 



where 



OL-i,k = [ f{t)4'Lk{t)dt for k in )Cl- 
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and 

6i.fc= [ f{t)ipjk{t)dt forj>L and 0<k<2^. 
Jo 

Where there is a need to distinguish between the boundary and interior 
wavelet coefficients, we write 9^ for the coefficients with j > L and k G JCj 
and 9^ for the boundary coefficients, those with j > L and k € /C^. 

5.4. Constructing wavelet coefficients from discrete data. Suppose now 
that we are given a vector of observations or of values of a function. In 
order to map these to scaling function coefficients at a suitable level, it is 
necessary to construct appropriate preconditioning matrices. In this section 
we define these matrices and set out certain of their properties. For more 
details, see [32]. 

On the left boundary define the R x R matrix W and the (5 — 1) x ii 
matrix U by 

Wm= x^(t)f{x)dx, fc= l,2,...,i?;^ = 0,l,...,i?- 1, 
Jo 

Uji=/, j = l,2,...,5;£ = 0,l,...,i?-l. 

Because U is of full rank, we can define to be an i? x (S — 1) matrix such 
that A^U = W. Similarly, the matrix A^ is constructed to satisfy A^U = W, 
where 

Wkt= / xV-fc(2;)(ix, k = l,2,...,R;e = 0,l,...,R-l, 

Uji={-lYf, j = l,2,...,5;£ = 0,l,...,i?-l. 

Given a sequence Xq, Xi, . . . , Xj\[^i with N = 2^^ , define the precondi- 
tioned sequence PjX by 

iPjX), = J2AiX,, keO:{R-l), 

1=0 

{PjX)k = Xk, k€{S-l):{N-S), 

{PjX)k = A§^,^,XM-i, ke{N-R):iN- 1). 
1=1 

If the Xi are uncorrelated with variance 1, then the variance matrix of 
the first part of PjX is A^{A^y, while that of the last part, with indices 
taken in reverse order, is A^[A^y . 

There is some freedom in the choice of A^ and A^ . For example, they can 
be defined such that not quite all the original sequence is needed to evaluate 
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the preconditioned sequence. Specifically, to eliminate dependence on the 
first or last S — R — 1 values of the sequence, let Ui be the square invertible 
matrix consisting of the last R rows of U, and let = [Orx{S-R-i) '■ ^U^^], 
and correspondingly. 

If, on the other hand, we have all the values in the sequence, then we have 
more freedom to choose A^ and A^. A natural choice is A^ = WU^ and 
A^ = WU~^ , where the superscript + denotes the Moore-Penrose generalized 
inverse. These choices will minimize the traces of the matrices A^{A^y and 
A^{A^y and, hence, the sum of the variances of PjX, if we suppose that 
the Xi are uncorrelated with unit variance. 

In general, under the same assumption on X, let ca be the maximum of 
the eigenvalues of A^{A^y and A^{A^y. Let Y be the boundary-corrected 
discrete wavelet transform of the sequence N~^/'^PjX . Then the eigenvalues 
of the variance matrix of PjX will be bounded by ca- Because of the orthog- 
onality of the boundary-corrected discrete wavelet transform, the variance 
of the elements of Y is bounded by caN~~^ . 

The array Y^ of interior coefficients [Yjk : L < j < J, S - 1 < k <2^ - S) 
will only depend on Xj for S — 1 <i < N — S, in other words, those Xi left 
unchanged by the preconditioning. Therefore, Y^ will be an uncorrelated 
array of variables with variance . 

The preconditioning also makes it possible to get very good approxima- 
tions to the wavelet coefficients of a smooth function g from a sequence gj 
of discrete values gji = g{i2~^). Define the vector S jg to be the precondi- 
tioned sequence Pjgj. For any smooth function g, each 2~'^/'^ Sjkg is a good 
approximation to the scaling coefficient / gct)jk of g at level J. To be precise, 
provided g is R times continuous differentiable on [0, 1], we have for each k, 



(33) 



2-^/2 [\{t)(t>jk{t)dt 
Jo 



< c2--^^sup{|5(^)(x)| :xe supp((/>jfc)}. 



The result (33) depends on the vanishing moment properties of the scaling 
functions and on the construction of the preconditioning matrices. For full 
details, see Proposition 3 of [32]. 

5.5. The boundary- corrected empirical Bayes estimator. In this section 
we set out a detailed definition of a boundary corrected version of the empir- 
ical Bayes estimator, and prove that it has attractive theoretical properties. 
Assume throughout that a boundary corrected basis is in use. 

Assume that for N = 2"^ we have sufficient observations 

Xi = f{i/N) + Ei, £i independent 7V(0, 1) 

to evaluate the preconditioned sequence PjY . Let Y denote the boundary 
corrected discrete wavelet transform of N~^/'^PjX . 
Define the estimated coefficient array 6 as follows: 
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• Estimate the coarse scaling coefficients by their observed values, so set 

• Estimate the interior coefficients 9^ by applying the empirical Bayes method 
level- by-level to the observed array . 

• Threshold the boundary coefficients separately. At level j, use a hard 
threshold of taH /N)^/'^ , where t\ > 2(1 + A)c^log2, so that for each 
fcG/Cf 

6,k = YjkI[\Yjk>TA{j/N)^/^\]. 

• For unobserved levels j > L, set 9jk = ^- 

In our main theoretical discussion, we measure the risk of this estimate 
as an estimate of the wavelet expansion of the function itself by 

oo 

(34) W^^^^M = EWh-i - Ql-x \\\ + E - ^^\\% 

j=L 

If we use 6 as an estimate of the discrete wavelet transform array 0, then 
the natural measure of accuracy is the risk R as defined in (29). However, it 
should be noted that because of the preconditioning, the array 9 only spec- 
ifies uniquely the values of the sequence f{i/N) away from the boundaries. 

The main result of this section demonstrates that the estimate has optimal- 
rate risk behavior over q and p down to zero. 

Theorem 2. Assume that the scaling function (j) and the mother wavelet 
ip have R continuous derivatives and support [—S + 1, S] for some integer S , 
and that J x"^4'{x) dx = for m = 1, 2, . . . , i? — 1. Assume that the wavelets 
and scaling functions are modified by the boundary construction described 
above, and that the thresholding is carried out by a modified threshold method 
with ^ > 0. Assume that the available data and the construction of the esti- 
mator are as set out above. 

Assume that a> a, a> s, a < R, and sq < A. Assume either that a > ^ 
or that a = p = 1. Assume that < p < oo and < q < 2. Let r = (a — 
a)/{2a + l) and r' = {a — s) / {2a) . Let T{C) be the set of functions f whose 
wavelet coefficients fall in bp^{C). Then there is a constant c independent 
of C such that, for suitable r'" and suitable A and A', 

(35) sup R*j^.,if)<c{A{C,N) + Cm-'""'nog^' N + N-'^/^log^N}, 

where A{C,N) is as defined in (27) in Theorem 1, and, for all fixed C, 
N'^'^'ilog^' N is of smaller order than A{C,N). 
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The general correspondence between Besov sequence and function norms 
is discussed in Section 5.6. The case a = p = 1 is included because the space 
defined by membership of the Besov sequence space b\ ^{C) is well known 
to include the space of functions with appropriately bounded total variation. 

If our only concern is for the accuracy of estimation of the array 0, then 
we have the bound 

(36) sup RN,ciAf)<(^{^{C,N)+N~'^'Hog^N}. 

5.6. Besov array norms and function norms. Our theory gives minimax 
risk bounds over functions whose array of wavelet coefficients fall in bp^{C) 
Besov sequence balls. Under appropriate assumptions on the wavelet basis, 
these Besov sequence norms on the wavelet coefficients are equivalent to the 
corresponding Besov function norms on the functions themselves. Relevant 
results for the specific case of boundary-corrected wavelets on a bounded 
interval are considered in detail in Appendix D of [29] . The equivalence will 
certainly hold for the wavelets with bounded support and vanishing moments 
up to order R — 1, provided p>l,R>2,R>a and a > max(0, | — |)- See 
also [19, 20] and further literature referenced there. 

The i?p oo Besov function norm is not very easy to grasp intuitively, and 
for integers m it is helpful to compare the Sobolev space W^, which has 
norm minimax results hold over balls in Besov 

spaces; for p > 1, the standard result that is embedded in the space B^^^^ 
shows that the results will hold for minimax rates over balls in the Sobolev 
space W^. 

Turn now to the error measure. Suppose that / is a function with wavelet 
expansion given as in (32) above. Given any a >0 and < q<2, define 

oo 

r,,M) = \\(^L-i\\l + T.^'''\W^ 

where a = s — ^ + |- This corresponds to the risk measure (24) relative to 
which our theoretical bounds are obtained. We state and prove a proposi- 
tion showing that the error norm dominates various function seminorms. It 
follows from the proposition that, for a > and for q in (0,2], the bounds 
on estimation error proved for the Besov body error measure rq^„{f) hold a 
fortiori for error measured by the integrated qth power of the derivatives up 
to order cr, provided that the wavelet satisfies appropriate regularity con- 
ditions. Note that the lower bounds on a and q are zero, rather than the 
larger bounds on a and p required for full Besov norm equivalence. 
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Proposition 1. Suppose < a < R. For any integer r such that < 
r < a, and q in (0, 2], 

['\f^'Ht)\'dt<cr,,M)- 
Jo 

Remark. The one-sided bound leaves open the possibihty that the 
Sobolev norm might be much smaller than the Besov iJ^ ^ norm on the 
right-hand side. However, for 1 < g < 2 there is a reverse inequality, 

ci||/b,-_,<||/||w^ 

in terms of a Besov norm on the space 2 (defined, e.g., in [47]). Since 
the minimax rate of convergence is the same for the 2 and the B^ ^ error 
norms on regular and logarithmic zones [22], it can be said that we are 
capturing the situation for the Sobolev norm without too much loss. 

5.7. Comparisons. In this section we compare the theoretical results for 
empirical Bayes estimators established in Theorems 1 and 2 in this paper 
with those known for some other existing thresholds. 

The universal threshold aE\^2logN of [21] leads to rates of convergence 
that are suboptimal by logarithmic terms in the regular case and some crit- 
ical cases; see the detailed discussion in [22], Section 12.1. For example, 
using the notation of (26) and (27), the bound on the rate in the regular 
case ap > sq would be A{C,N) = C^^-'^''^''{{logN)/NY'' . The reason that 
the universal threshold is suboptimal in this way is essentially that, in dense 
cases, thresholds should be set at a bounded (and small) number of standard 
deviations, rather than being of order y/2TogN. 

For the SURE threshold recalled in Section 1.6, Donoho and Johnstone 
[19] and Johnstone [27] establish asymptotic optimality results under the 
special conditions of squared error loss (q = 2) for estimating the function 
{a = 0) over Besov bodies with p> 1. Since the SURE estimate chooses 
thresholds to optimize an unbiased estimate of mean squared error, it is 
possible with these restrictions to obtain not only optimal rates, but also 
to show that the limiting MSE is minimax optimal even at the level of 
constants among threshold estimators. By the same token, it is less clear 
that one could expect better than optimal rates for other loss functions (say 
q < 2) — and even the rate issue remains to be formally investigated. 

A more serious restriction of SURE is reflected in the constraint p > 1 . As 
is discussed in [3] and [33], and illustrated in Figure 5, the SURE criterion 
U{t) in (11) is far from smooth in t. The asymptotic oracle inequality for 
SURE in Theorem 4 of [19] contains an error term of crude order 
and this term has prevented any optimality conclusions from being drawn 
when p < 1. The instability shown in Figure 5 seems to derive from the 
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Fig. 5. Instability of the SURE criterion compared to stability of the MML criterion. 

Three replications for Zu N{^j,k, 1) tuith fik = 7 for k = 1:5, zero for k = 6: 1000. Top 
panels show the SURE criterion U{t) of (11) (solid ) and its expectation (dashed ). Bottom 
panels show the quasi-Cauchy score function (18) (solid) and its expectation (dashed ) as 
a function of the ( quasi- )threshold t{w) defined by solving l3(t,0) — 1/w. (Johnstone and 
Silverman [33] has more on the quasi-threshold. ) 



derivative discontinuity created by the threshold zone: similar plots were 
obtained when applying the SURE criterion to the posterior median rule to 
estimate thresholds. 

Related to this is another deficiency of the SURE threshold choice. While 
SURE adapts to squared error loss well on "dense" signals, the criterion does 
not reliably propose high thresholds for sparse signals. In order to obtain the 
theoretical results just cited, it was necessary to introduce a hybrid version 
of SURE containing a pretest for sparsity, which if detected, switched to a 
^J2 \ogn threshold. Thus, the hybrid version creates a grossly discontinuous 
transition in thresholds which, while sufficient for the theoretical result, is 
unattractive in practice. Indeed, the simulations in [33] found the hybrid 
modification to be counterproductive in the examples considered. 
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Turn now to the levelwise FDR as applied in the wavelet context. While 
there has been extensive analysis of the exact adaptive minimax optimality 
of FDR in the sparse single sequence model [3] over ip balls and losses for 
< p, g < 2, there has been no published analysis of rates of convergence for 
a levelwise FDR estimate in the wavelet shrinkage setting. In unpublished 
work, IMJ combined the optimality properties of FDR for sparse signals 
with the advantages of SURE for dense signals using an improved pretest 
for sparsity. Adaptive optimality of rates of convergence was obtained under 
the conditions of Theorem 1 in the case q = 2,a = Q. However, this work 
was abandoned in favor of the present empirical Bayes approach, due to the 
latter's smoother transition in threshold choice between sparse and dense 
regimes, reflected in better performance in actual examples. 

Birge and Massart [9] investigate a complexity-penalized model selection 
approach for Gaussian estimation and give a nonasymptotic risk bound for 
squared error loss (the case g' = 2, cr = here). When applied to our setting, 
their approach yields estimators that are minimax up to constants. The 
connection between the kind of "21og(n/A:) per parameter" penalties used by 
Birge and Massart [9] and Abramovich, Benjamini, Donoho and Johnstone 
[3] and FDR estimation is discussed further in the introduction to the latter 
paper. Finally, Paul [41] obtains optimal rates of convergence in certain 
inverse problems, again for g = 2,cj = 0, which in the direct estimation case 
would reduce to Theorem 1. 

There has also been recent work on the optimality of Bayesian wavelet 
estimators based on mixture priors, though not from the adaptive estimation 
perspective: see, for example, [1, 42]. 

6. Proofs of main results. The proofs of the main theorems make use of 
results of Johnstone and Silverman [33] for the maximum marginal likelihood 
procedure in the single sequence case (3). In Section 6.1 we review these 
results, in a form recast to be useful for the multilevel problems raised in the 
current paper. In Section 6.2 we begin by giving an intuitive overview of the 
proof of our main result. This demonstrates the way that different kinds of 
error bounds are needed for different levels of the array; implicit in the proof 
is the way that the level-dependent empirical Bayes approach automatically 
adapts between these. After our intuitive discussion, the formal proof of 
Theorem 1 is given. The proof of Theorem 2 follows in Section 6.3; this 
makes use of approximation properties of the boundary-corrected bases and 
preconditioning operators defined in Sections 5.3 and 5.4 above. 

6.1. Results for the single sequence problem. For a vector 9 G M", and 
< g < 2, suppose we have observations Z ~ Nn{9,e'^In)- Estimate 9 by 
applying the marginal maximum likelihood approach to the data e~^Z and 
then multiply the result by e to obtain an estimate of 9. Assume that we 
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(37) 4(n) 



are using a mixture prior (4) with 7 satisfying the assumptions set out in 
Section 2.1, and a famify of thresholding rules with the bounded shrinkage 
property. We may use a modified thresholding method with ^ > 0, as defined 
in (7). By convention, let ^ = denote the unmodified case. Define 

log2+('?-PA2)/2^^ if^ = 0, 

.n"^(logn)(5-i)/2, if^>0. 

Then, by making appropriate substitutions into Theorem 2 of [33], we 
obtain the following result. The rates of convergence achieved by the leading 
terms in the various bounds are the minimax rates for the various parameter 
classes considered; see [33] for more details. 

Theorem 3. Suppose that the above assumptions hold and that <p< 
00 and < q <2. Then the estimate 9 satisfies the following risk bounds. 

(a) (Robustness.) There exists a constant c such that 
(38) E\\9 - eW^g < cue" for all 0. 

(b) (Adaptivity for moderately sparse signals.) There exist constants c 
and r]o such that, for sufficiently large n, provided Cq < ne^rj^, setting £1 = 



(39) sup Ewe-ew^gK 



2{n^-(i/p)C^ + e'is\{n)}, ifp>q, 
.Tco""" """" \c{C^e'i~P + e'^s*An)}, ifp<q. 

(c) (Adaptivity for very sparse signals.) For q> p> and ^ > 0, we also 
have, for sufficiently large n, 

(40) sup E\\e -e\\l<c{Cl + e'is\{n)} /or Cq < e(logn)i/2. 

¥\\v<Co 

If q £ (1,2], these results also hold if the estimation is carried out with the 
posterior mean function for the weight estimated by the marginal maximum 
likelihood procedure. 

In order to get an intuitive understanding of the way this result will be 
used in the multilevel setting, focus attention on the case q > p and ignore 
the error term e'^s'^{n). The three results in Theorem 3 allow us to consider 
three zones of behavior of the underlying signal. 

The first zone has large signal-to-noise ratio Cq/e > n^^^rjo. Here the best 
bound we have is a risk of order ne'', corresponding to the global risk of the 
maximum likelihood estimator 0mle(-^) = Z. 

In the second zone, the signal-to-noise ratio is smaller and, since q > p, 
the risk can be substantially reduced by thresholding. A hard threshold 
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rule with threshold ei applied to the Z will typically make an error in any 
individual coordinate of at most ei . A least favorable configuration satisfying 
the constraint \\0\\p < Cq would occur with (Co/ei)^ coordinates of size (a 
little less than) £i each, and the rest being zero. This leads to a total error 
of order (CoM)^^!- 

In the third zone, the region described by (40), the signal-to-noise ratio 
is so small that there is no benefit to attempting estimation at all, and 
9q{Z) = is the natural estimator. This incurs risk ||0||^ < 116*11^ < Cq. 

The discussion for q<p \s similar and simpler, involving only two zones, 
in the first of which the performance of the estimator is similar to that of 
^MLE and in the second to the zero estimator ^o- The impact of the theorem 
is that the empirical Bayes estimator adaptively achieves, roughly speaking, 
the best possible behavior whichever zone the signal actually falls in, without 
having to specify p or g or Co in advance. 

We remark that, while Johnstone and Silverman [33] assumed < p < 2, 
we have subsequently checked that the results extend to p < oo and we use 
this broader range in this paper. 

6.2. Proof of Theorem 1. 

Heuristic introduction. Before the formal proof, we continue the intu- 
itive discussion in order to give a heuristic explanation of where the rates 
of convergence and the phase change in the proof come from. In addition, 
we can gain an understanding of which kinds of estimators and bounds are 
needed (and, indeed, are imitated by our empirical Bayes method) at which 
levels of the transform. The discussion is inspired by the "modulus of conti- 
nuity" point of view of Donoho, Johnstone, Kerkyacharian and Picard [22], 
but is adapted to the present setting. In the heuristic discussion, we ignore 
constants, error terms and so forth. 

We apply the bounds of Theorem 3 level by level, with noise level e = 
The multiresolution index structure and Besov body constraints im- 
ply that, at level j, we have n = Kj ^ 2^ and Co = 2~"'^C. In the heuristic 
discussion approximate ei by (log iV/A^) for simplicity, but in the actual 
proof this approximation is not used. With these substitutions, in the case 
q> p the zones of Theorem 3 translate to 

sup 2'''^E\\ej-eM 



(41) 
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/ N \ 

2-{a.V-sq)j (jpl ] (jjyl/22-(a+l/p)j <; ]_ 

\logNj 

«i < / \ ■ 

1 andC 2-'^J>l, 

/ \ 1/2 

2~{a~s)qj(jq Q I 2^"^ < 1. 

[ \logNj 

The first zone corresponds to the coarsest scales; the transition to the 
middle zone occurs at scale ji defined by 2(°"'"^/^)-'i = CN^''^, and the third 
bound applies at scales above the finer index j2 defined by 2"-^2 = C{N/ log N)^/"^. 
The risk bounds increase geometrically as j rises to ji and fall off geomet- 
rically as j increases above j2. The key to the behavior of the overall risk 
is ap — sq, because this determines the way the risk behaves in the zone 
between ji and j2- 

If ap > sq, then the least favorable index is ji, and with geometric decay 
of risks away from this level, the rate is determined by 2^*'^"^"^^-'^ A^"*^/^ = 

^l—2rj\j'—rq 

If ap < sq, the least favorable index is j2 and the rate is given by 2(*~")'^-^^C^ = 
(ji-2r (iQgjy/jsiy q^ Because of the extra logarithmic terms, this set of values 
of {a,a,p,q) is referred to as the "logarithmic zone." Compare Figure 6. 

If ap = sq, then each level between ji and j2 contributes (in our heuristic 
approximation) an amount equal to the maximum in the case ap < sq. There 
are O(logA^) such levels, leading to an extra logA^ factor. 

We now give the formal proof, following this overall strategy. 




■ ■ resolution 

■h J2 level 



Fig. 6. Schematic of risk contributions (41) by level j in the "logarith- 
mic" phase where ap < sq. In the setting Z ~ Nn{9,s^I„) of Section 6.1, set 
(C, n) = inf^ sup||g||^<c E\\§ - e\\l 
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Formal proof: division of the risk. We split the sum in (24) into parts cor- 
responding to the scahng coefficient, large-scale, fine-scale and very fine-scale 
parts of the risk. Define no and r/o so that (39) is satisfied. Let ji be the small- 
est integer for which j > max(L, logs no) and x 2-^'^+^/p'^^^+^'>CN^^^ < 
r/o- We will then be able to apply the bound (39) at levels j > ji- From the 
second property of ji , 

(42) 2-J'l < c(C^iV)-P/-t2(l+ap)} ^ ^^fj2jyyl/{2a+l) _ 

Also, ji must satisfy either 2^^ < no V 2^ or 2(°+i/2)ii < , so 

(43) 2^1 < cmax{l, (C2iV)i/(2a+i)|_ 
Now write 

(44) RN,gA^) = E\\9l-1 - Bl-iWI + + ^mid + RhU 

where 

ilA(J-l) 

iimid= E '^''"E\\0j-0j\\^ 
ji<j<J 
oo 

If C'^N < 1, then in each case the last term in (26) will dominate the 
first. Therefore, we can assume throughout, without loss of generality, that 
C > A^~^/^, since there is no point in proving the result for any smaller 
values of C. It then follows from (42) and (43) that 2^^ x (C2A^)i/(2a+i) ^ 

Scaling function risk. Let bq be the qth absolute moment of a standard 
normal random variable. Then 

(45) E\\9l^i - ^L_i||^ < KL_i6,iV-5/2 ^ 

Note that this holds regardless of the size of the scaling coefficients, and 
that the proof of Theorem 1 only uses the bound (25) for j > L. 

Risk at coarse scales. To bound i?io, use the result (38) and the property 
sq + l = q{a + ^) > > to give 

(46) Rio<cN-''/'^ J2 2(''«+i)j' <ciV-'^/22(i+«9)ji. 

L<j<ji 
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Therefore, since (1 + sq)/{2a + 1) = — r)q, 

(47) i?lo < ciV-'?/2(C2iV)(l/2-r)g < ^(j{l-2r)q^-rq_ 

In the case ap > sq, this is exactly of the magnitude required in (26). If 
ap < sq, because r >r' and C'^N > 1, the bound (47) wih be smaller than 
the first term in (26). 

Risk at fine scales. As noted previously, by the definition of ji, the bound 
(39) can be applied for the terms in the sum for Rmid- If Ji > — 1 there 
are no terms in the sum and -Rmid = 0. Otherwise, set 6 = {q — 1) /2 if A > 
and 2 + (g - p A 2)/2 if A = and define 

52 = iV-'?/2 ^ 2"'^s\{Kj) 

h<3<J 

(48) 

<iv-«/2 2-(^-"'?V<ciV"''/^(log^^)^ 

31<3<J 

in every case, using the definition (28) of v. 

Since at level j we have Cq = C2~"-', considering the two terms in (39) 
now yields 

(49) /2mid<c(5i+52), 
where 

'CI 2"^^2^^-'>M3 2-''ij, iip>q, 

ji<j<J 

(7PjV~(^~P)/^ Y 2~(''P~''^)-''{log(2(^°+^^-''A^~^C~^)}^''~''''/^, 
ji<j<J 

if q> p. 

We consider four cases for Si and show that in every case Si < cA(C, N). 
Combining with the bound (48) for S2 allows us to conclude, in all cases, 
that 

(50) i?mid < c{A{C,N)+N-'i/\logNr}. 

Case la, q < p- In this case we necessarily have ap > sq since a> s. If 
p>q, the exponent in the defining sum for is {sq + 1 — q/p — aq)j = 
— {a — cr)qj, and, hence, is geometrically decreasing. Therefore, 

(51) Si < cC'?2-("-'^)«i < cC'iC^N)-'-'' = cC^^-^^'^m-'^i. 

Case lb, q> p and ap > sq. The sum in Si is now geometrically decreas- 
ing apart from a log term, and so, using the property that 2^"^"^^^^^ x C'^N, 

Si < cCPA^-(^-P)/22-("P-^'^)ji{log{2(2"+^)ji AT-iC"^)}^^"^^/^ 

(52) 



Si 
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Case 2a, q> p and sq > ap. In this case, necessarily s > and, therefore, 
^ > since we require sq < A. Define j2 by 

2'^J2 =C7(iV/logiV)i/2_ 

We now spht the bounding sum for 5i into the two zones (ji, 72) and [j2, J). 
In the lower zone, since C'^N > 1, we can bound 

(53) log(2(2"+^)JA^-iC7-2)<(2a + l)j2log2<clogiV forj<j2. 

In the upper zone we use the bound (40). 

The two sums obtained are set out in the following display. Since the terms 
in the first sum are geometrically increasing, and in the second geometrically 
decreasing, both sums are dominated by a multiple of their value at j = j2 '■ 

(54) Si<cCP{N~HogN)''i-P^/^ 2("'?-"P)j' + cC'' ^ 2*''J2-'^« 

ji<j<mm{j2,J) 3>h 

< cC^ N"^''"'''^^'^ {log nY^~p^^'^2^'^^~'^^^^^ + cC"^2~^"~'^)-'^ 

(55) < cC^^-^'''^iN~'''''{\ogNY'i, 

after some algebra, substituting the definition of j2 . 

Case 2b, q> p and ap = sq. Argue as in Case 2a, but now the first sum 
in (54) is no longer geometric but is bounded by j2 < clogA'^. Carrying the 
extra logA^ factor through the argument yields 

(56) Si<cC'-^~'^'''^'^{N/logNy'''nogN = cA{C,N). 

Risk at very fine scales. Define A = (| — i)+,so that r" = a — s — A and, 
whatever the relative values of p and q, \\9j\\q < 2^-'||^j||p for each j. Then 

00 00 
= 2''^'ll%llg < E 2'''^2^« 11^^-11^ 

00 

3=J 

Conclusion of proof and consideration of related results. Combining the 
bounds (45), (47), (50) and (57) now completes the proof of Theorem 1. 

To obtain (30) subject to the constraints (23), we follow exactly the same 
argument, noting that there are no coefficients to estimate for j > J, and so 
there is no term corresponding to (57). 

To prove the result (31), where observations at all scales are available, 
modify the limits of summation where necessary. In the calculations for -Rmid > 
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the sums are extended to j = where appropriate. None of the bounds for 
is affected, but for S2 the calculation in (48) becomes 

ii<j<j2 

On the other hand, the sum in (57) now starts at j = J'^, leading to a bound 

Incorporating these two changes into the main argument leads to the result 
(31). 

6.3. Proof of Theorem 2. 

Remarks and preliminaries. In the estimation problem considered, the 
sequence Sjf is the vector of expected values of the preconditioned data 
PjX. Define 9 to be the boundary-corrected discrete wavelet transform of 
N~^^'^Sjf . Our procedure uses what is essentially an estimate of 9 to esti- 
mate the true coefficients 9. The conditions of Theorem 2 allow the difference 
between these two arrays to be bounded; by Proposition 4 of [32] we have 

(58) 2^^ \\9j- 9j\\p < cC2-"(^-j) for ah j with L - 1 < j < J, 

where a = a — (| — 1)+ > ^. An immediate corollary is that, for some fixed 
constant c, 

(59) \\9j\\p < cC2-''^ for L-l<j <J. 

Therefore, the "discretized" coefficient array 9 obeys (up to a constant) the 
same Besov sequence bounds as the "true" coefficient array 9. 

The precise value of the constant r'" in the theorem is min{a — s,a — 
a, a} = min{r", a}, with A' = 1 if a = min(a — s,a — a) and otherwise. We 
have already noted in the remarks following Theorem 1 that N^'^ is always 
of lower order than A(C, N) for fixed C. The same is true of iV-'^"''?log^'Ar 
since a > | and the log term can only be present if r'" > ^. 

Main component of error. Use the convention that / refers to the interior 
coefficients and B to the boundary coefficients. The l^-fc each have expected 
value 9jk, and for the interior coefficients are independent normals with 
variance 1. Because of the bound (59), we can argue exactly as in Theorem 
1 to obtain 

J-i 

2"^^E\\9j - 9j\\l + J2 2"'^'ll^illg 
, , j=L j>J 

(60) 

< c{A{C, N) + cm-''"'' + log^ N}, 
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where v is as defined in (28). 

Equation (60) gives the main part of the risk bound in Theorem 2, and 
the remainder of the proof consists in controlhng all the other contributing 
errors. 

Coarse scale error. Consider first the coarse level scaling coefficients 
Ol^i. Since the variance of each element of 9l-i = Yl-i is bounded by 
N^'^CA we have 

(61) EpL-i - h-iWl < cN-''/^cf2^ < cN-'i/^. 

Boundary coefficients. The contribution of the estimates of the boundary 
coefficients is considered in the following proposition. 

Proposition 2. Under the assumptions of Theorem 2, uniformly over 
T{C) as J ^ oo, 
J-i 

(62) '^'"^Epf - ~ef\\l < c{(:7(i-2^')«(iV/logiV)-'-'9 + iV-'?/2}. 

Proof. Define -Rboundary to be the sum on the left-hand side of (62). The 
array 9^ has the same number of coefficients at every level j > L, and the 
elements of the array are normally distributed with expected values 6^ , 
and variances bounded by caN~^. We obtain the 9?f, by individually hard 

thresholding the Y^^ with threshold TA{j /N)^/'^ , so by standard properties 
of g-norms and thresholding, 

E\e,k - e.kl' < c{E\9jk - Yjk\' + E\Yjk - ^jfcD 

(63) < c(j«/22-5-^/2 + 2-9^/2) 

If s < 0, use the bound (63) to give 

(64) i?bou„dary < c2-"^/^ f^''^''" < ciV'^'- 

For s > 0, define j2 by 2"J2 = C(iV/ log iV)i/2 gpj^^ ^j^g j-jgj^ ^^^^ 
two parts. Letting Rio be the risk for boundary coefficients at levels below 
min(j2, J), arguing as in (64), 

i2A{J-i) 
Rio < cN'i/^ j'/^2^« 

(65) <c(7V/logiV)-«/22^'?i2 

<cC(^-2'^')<?(Ar/iogAr)-^'9. 
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Let i?hi be the contribution to -Rboundary from levels j > j2. For j > j2, by 
Proposition 1 of [33], taking account of the bound ca on the variance of the 

(66) 

< c{|^jfc|« +i(''-i)/2Ar-'//22-{i+A)i|_ 

Substituting the bound (66) and using the property sq< A gives 

(67) i?hi<c 2J'^«||^f ||^ + ciV-'?/2 ^ j('?-i)/22-i. 

32<i<J 3>32 

Since the vector 9^ is the same length for all j, for some constant c 

independent of j we have H^^Hq < < cC2~"-' by the bound (59). 

Therefore, 

j>j2 

(68) < cC"?2-("-")5J2 + ciV-<?/2 

< cC(^-2^')«iV-^''? log^'^ N + cN-i/^. 

To complete the proof, combine the bounds in (64), (65) and (68). □ 

Discretization bias. The risks (60), (61) and (62) all quantify errors 
around the discretized coefficients 9. To control the difference in the risk 
norm between these coefficients and the true coefficients 9, define A = 
(i — -)+ as in the proof of (57). Using the bound (58) and the property 
r" = a — s — A, it follows that 

J-i J-i 

o(s+A)gj||, 



2^<ij\\§. -0.\\'i< 2(^+^)«J'llfl^- - «^H'' 



j=L-l j=L-l 

(69) 

j=L~l 

If a < r" , the expression is bounded by cC^J^'2~°"''^ since A' = 1 if and 
only if a = r" . On the other hand, if a > r" , the sum in (69) is geo- 
metrically increasing, and so the expression is of order cC"^2~'" Since 
r'" = min(r",a!), all cases are combined in the bound 

J-i 

(70) '^'"^W^j - ^jWl ^ cC''N''"''J{\ogNf. 

j=L-l 
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Completing the proof. To complete the proof of Theorem 2, we combine 
the bounds (60), (61), (62) and (70). For r' < r, this gives the required result, 
with X = u. 

For r' > r, we have an additional term from (62), proportional to C^^"^^ ^'^ x 
(N/logN)~^ ^. By elementary manipulation, we have 

^(l-2r')(^/log^)-r' 

r (:7{l-2r)^-r^ -f (C2^)r'-r > (logiY)^'', 

- I Ar-l/2(logAr)(l/2y{l-2r)/{r'-r)^ (C^iV)-'-- < (logiV)^''. 

It follows that the bound (35) holds for this case also, setting A = max{z^, ^qr' x 
(1 — 2r)/(r' — r)}. This completes the proof of Theorem 2. 

The corresponding theorem with periodic boundary conditions on the 
functions and the wavelet decompositions is also true, and can be proved by 
a simplified version of the same approach, without any need for precondi- 
tioning or for the consideration of boundary coefficients. 

To prove the result (36), we use exactly the same argument as above, but 
there is no need to include the discretization error or the error due to levels 
of the transform with j > J. 

7. Further proofs and remarks. 

7.1. Proof of Proposition 1. Note first that, for q> 1 and values of the 
other parameters such that rg^o- can be shown to be equivalent to the cor- 
responding function Besov norm, the result follows from the embedding of 
the Besov space B^''' in the Sobolev space consequent on results in Sec- 
tion 3.2 of [47]. However, to deal explicitly with all parameter values and 
with our boundary construction, we give an argument that does not use this 
embedding. 

If g < 1, the function ipj^j^ has support of length at most (25* — 1)2"-^/^ 
and maximum absolute value bounded by dl^l'^T'K Therefore, for all j and 

J iV'jfcV < c2~^'/22-'9/22''J''? < c2'^'^. 
A direct calculation using the property |x + y|'^<|a;|'' + |y|'^ now shows that 

['\f^^\t)\Ut 
Jo 

rl °° 2^—1 „x 

fcG/CL-i ° j=L k=0 •'^ 

(71) 
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<c\\eL-i\\l + cY^r^ne,\\i 

For 1 < q < 2, we consider the interior and boundary cofficients of / sep- 
arately. Define 

oo oo 



We have, first, 

nl 







\fi:\t)\ut<snp\fr{t)\ 



(72) 



* \fcG/Cz,_i / 



<c\\eL-i\\l. 



Now consider //. Let Xjk{x) be the indicator function of the interval 
[2-^k,2-i{k + 1)]. By Theorem 2 of Chapter 6 of [37], using the fact that 



ia<l, 



\fi'\t)\Ut<c 



J2 E 2^2'^'0%xjkix) 



j=LkeJC'j 



q/2 



dx 



(73) 



oo „i 



oo 
j=L 

Finally, we consider the contribution from the boundary wavelets. Let Sj 
be the union of the supports of the boundary wavelets at level j. Define 
Tj = Sj \ Sj+i and aj = 2('^+V2)i y^. Then 



(74) 



^^'^1 keicf 
<c\\e,\U2(^+'/''>n[S,]=ca,I{S,]. 



fee/cf 



(75) 

= c2 
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Using (74), the nesting properties of the Sj, and the property that \Ti\ = 
c2~^, we obtain, applying Holder's inequality, 

Coo \ Q / \ t 

Y.a,I[sA =cj [Y^aA 
j=L / -^^^ \j=L / 

/ i \1 i 
\j=L / j=L 

Using the bound (75) for each term, we can now conclude that 

1 CXD „ 

\fPit)\'dt=Y: / \f^B\trdt 

£=L •''^t 

oo C f. ~\ 

£=L I j=L ) 



(76) 



j=L \ e=j / 

oo oo 
j=L j=L 

Using the bounds (72), (73) and (76), we can conclude that 

completing the proof. 

7.2. Howmuch is lost without bounded shrinkage? The Besov space adap- 
tivity result Theorem 1 provides a context within which the importance of 
the bounded shrinkage property can be assessed. Construct a coefficient ar- 
ray 9 by setting ^3^0 = C2~^" and all other 6jk = 0. Then for any p, p and 
a, 9 will be a member of the Besov sequence space bp p{C). (Obviously the 
choice of level 3 to have a single nonzero element is arbitrary, and any other 
fixed position can be used.) 

Let Z = N^/'^Y^fi and fi = N'^/'^9^fi. Use the mixture prior (1 — w)5o + w"f 
for setting 7 to be the normal A^(0, r^) density. Whatever the value of w, 
the posterior median function then has the property 

|/i(z, w)\ < \t\z\ where \r = r^/ (1 + r^), 
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and the same inequality holds for the posterior mean. Since > 0, whether 
w is a fixed or random weight, we have 

(77) Eift - iif > E{fi - fifl[Z < ;u] > i(l - A,)V'- 

Multiplying both sides of (77) by shows that the the mean square error 
risk satisfies 

RnM^) > E{e^fl - ^3,o)' > \{i - Kfelfi = \{i - \r?2~^''c\ 

which does not tend to zero as N ^ oo. Hence, the maximum risk over any 
Besov sequence class does not diminish as N increases, and no adaptivity 
result of the type given in Theorem 1 can be proved. 

In the case where 7 has tails asymptotic to exp(— c|t|^) for some A G (1, 2), 
it can be shown that, at least for large fi, |/i — //| > c/i'*'"^. Consideration of 
the same counterexample as above then demonstrates that, again, however 
the weight is chosen, the maximum risk over the Besov sequence space is 
bounded below by a multiple of N^^'^^. For large a this will dominate the 
rate in Theorem 1, and thus the assumption that 7 has tails at least as heavy 
as exponential cannot essentially be relaxed without restricting or removing 
the adaptivity demonstrated by the theorem. 

7.3. Results for the posterior mean. If the estimation is conducted using 
the posterior mean rather than a strict thresholding rule, the results of 
Theorems 1 and 2 still hold for 1 < g < 2, since the bounds of Theorem 
3 apply in this case. For smaller values of q in the single sequence case, 
Johnstone and Silverman [33] show that the failure of the posterior mean 
to be a strict thresholding rule has a substantive effect on the overall risk. 
However, their counterexample does not unequivocally settle the question 
of the behavior of the posterior mean estimator in the wavelet case. The 
possible extension or modification of the theorems for the posterior mean 
estimator for g < 1 is a topic for future investigation. 
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